Data Quality - Catching Rule Violations

Data Quality – Catching Rule Violations

by Paul Friedland

Introduction
This is my third installment of blog articles about Data Quality. In the first article, I postulated that data has quality when it has an acceptable level of errors. The goal of course is no errors, but that might be too expensive to insure. Data governance, master data management (MDM), or another data architecture team charged with the governance of corporate data, needs to, among other things:

assess and monetize the cost of errors vs. the cost of detection;
define and limit data values within the information context; and,
employ the right tools and techniques to find errors or suspicious values.

In the previous article, we demonstrated some ways to catch errors that appear wrong on their face. Without knowing the application these data just looked wrong; e.g. alpha characters in a numeric field, impossible dates, missing special characters, and so forth. In this article, we show how to find errors in data that looks right, but contains values outside of an acceptable range for a specific application. This would take the forms of membership: 1) numerical values within a range; or, 2) discrete values that should or should not be present.

Planning
After finding real or possible errors, an additional responsibility is to discover the correct value and insert it into the data store. If that is not possible, those managing the data must report to the application groups about specific data that is not to be trusted. Before programming begins, there should exist an information diagram that associates data names with possible values, or, an acceptable range of values for each datum. Having rules provides two benefits to the organization:

a central repository of data names and rules for the data’s format and use;
a programmatic repository for the error-checking-software to follow.

The construction of this information requires knowledge and agreement of users about shared data. This rules library could be the only place where such knowledge is retained. The obvious place to begin is with errors that have been previously discovered. There would be no excuse for allowing the same error twice. Woe to the manager who discover errors after the production job is run when such errors could have been discovered through pre-processing. Then add the errors that might be anticipated.

Programming
Now comes the technical work of actually writing and running the data checking programs. This work is never finished because database contents are always changing, and sometimes the rules do, too. These programs must run in this kind of dynamic data environment, and whenever there are available machine cycles. The flow chart looks like this:

How IRI Does It
In the CoSort data management package — as well as its spin-off products for data migration, masking, and generation — IRI uses a 4^th generation language (4GL) called SortCL to define data layouts and manipulations. SortCL has constructs for searching files and tables, testing for conditions, combining transformations, and reporting to multiple targets.

SortCL jobs are defined in scripts written by hand or automatically generated by GUI wizards in an Eclipse-based IDE. The GUI, called IRI Workbench, also features online data discovery and metadata definition, along with job design, execution, and management systems. What makes the language and GUI particularly useful in finding errors is their ability to read from multiple data sources at once, apply mapping rules, and direct output to multiple targets.

Each target can have its own associated conditions, formats and contents. Whether or not a record contains an error, or is suspicious, its evaluation occurs through tests defined and assigned for each output. Time-wise, the output files or tables are typically small; incremental time differences as we specify additional target conditions are negligible.

In the following example, we are reading a Payroll file. In practice, the table would have thousands of entries and it would be too difficult to catch errors on sight. To automate error-checking, we have created look-up tables, or SET files, that contain only correct values for several fields. We also defined certain rules about jobs and ages. Finally, we inserted errors into the payroll to see if the program catches errors in membership (i.e. values out of range), and other rule violations.

Payroll

Benny, Jack     M   39    Comedian 88900
Black, Robert   M   54    JVP      25600
Carson, Johnie  M   68    Sculpter 67820
Einstein, Al    M   124   Genius   55000
Gleason, Jackie M   44    Hustler  45600
Green, Harry    M   102   Guard    31500
Grey, Damson    F   44    Clerk    101000
Heston, Charles M   44    Actor    45980
james, Henry    M   38    Groom    56789
Hoffa, Jimmy    M   66    Pilot    67000
Jones, John     M   37    Clerk    37000
Perry, Jane     F   56    Guard    98772
Shapiro, Edie   F   45    Welder   42123
Smith, Mary     F   42    CPA      62000
Stander, Gary   M   55    Guard    43567
White, Adam     M   45    VP       126000
White, Mary     F   06    Clerk    45478

Good_Users.set * Legitimate_Jobs.set

Black, Robert     Acct Clerk      35000
Carson, Kit       CPA             45000
Einstein, Al      Lab Guard       18500
Gleason, Jackie   JVP             42500
Green, Harry      Product Keeper  57000
Grey, Damson      Guard Genius    34567
Heston, Charles   Lot Pilot       75000
james, Henry      Stable VP       99000
Jones, John       Acct Welder     34500
Perry, Jane       Front
Shapiro, Edie     Shop
Smith, Mary       Acct
Stander, Gary     Factory
White, Adam       Office

CoSort SortCL Error Checking Script

/INFILE = Payroll # source file metadata
 /FIELD = (User, POSITION = 01, SIZE = 15)
 /FIELD = (Gender,POSITION = 17, SIZE = 1)
 /FIELD = (Age, POSITION = 21, SIZE = 3)
 /FIELD = (Job, POSITION = 26, SIZE = 10)
 /FIELD = (Salary, POSITION = 36, SIZE = 6)

/REPORT # sorting not req'd

/OUTFILE = Unrecognized_People # typo? fraud ?, . . )
 /FIELD = (User, POSITION = 01,SIZE = 15)
 /FIELD = (Check_user, POSITION = 20,SIZE = 12, SET = Good_Users.set[User] DEFAULT="Unknown User")
 /INCLUDE WHERE Check_user EQ "Unknown User" #

/OUTFILE = Invalid_Jobs # Jobs not on approved list
 /FIELD = (User, POSITION = 01,SIZE = 15)
 /FIELD = (Job, POSITION = 20,SIZE = 10) 
 /FIELD = (Job_check, POSITION = 35,SIZE  = 11,SET = Legitimate_Jobs.set[Job] DEFAULT="No such Job")
 /INCLUDE WHERE Job_check EQ "No such Job"
 
/OUTFILE = Age_Violations # Age related problems
 /FIELD= (User, POSITION = 01,SIZE = 15)
 /FIELD = (Age, POSITION = 20,SIZE = 5) 
 /FIELD = (Job, POSITION = 25,SIZE = 10)
 /INCLUDE WHERE Age < 18 OR Age > 100
 /INCLUDE WHERE Age > 65 AND Job EQ "Pilot "

/OUTFILE = Salary_Limits # something with salary
 /FIELD = (Job, POSITION = 5,SIZE = 10) 
 /FIELD = (User, POSITION = 15,SIZE = 15)
 /FIELD = (Salary, POSITION = 30, SIZE = 6)
 /INCLUDE WHERE Job EQ "Clerk " AND Salary GT "100000"
 /INCLUDE WHERE Job EQ "Guard " AND Salary GT " 45000"

Output Files:

Unrecognized_People

Benny, Jack        Unknown User
Carson, Johnie     Unknown User
Gleason, Jackie    Unknown User
Hoffa, Jimmy       Unknown User

Invalid_Jobs

Benny, Jack        Comedian       No such Job
Carson, Johnie     Sculpter       No such Job
Gleason, Jackie    Hustler        No such Job
Heston, Charles    Actor          No such Job
James, Henry       Groom          No such Job
Hoffa, Jimmy       Pilot          No such Job

Age_Violations

Einstein, Al       124   Genius
Green, Harry       102   Guard
Hoffa, Jimmy       66    Pilot
White, Mary        06    Clerk

Salary_Limits

Clerk     Grey, Damson   101000
Guard     Perry, Jane    98772

Database Profiling in IRI Workbench

IRI CellShield Secures Sensitive Data in Excel [video]

Data Quality – Catching Rule Violations

Related articles

Leave a Reply Cancel reply