Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

Using Data Templates to Find Data Format Errors

  • by Paul Friedland

address_formatsIntroduction

In Working towards Data Quality, we defined data quality (DQ) as a state in which data can be used for operations. What makes the quality of data high is the paucity of errors. We also outlined the responsibility of data architects, governance, and MDM teams to define, identify, and correct bad data.

In this article, we introduce the concept of data templates in IRI software, and show how they can be used to spot errors of form and format in data.


Why Automation Matters

Such errors would be obvious to anyone reading the data carefully. Of course, this is not a practical solution at the terabyte level.

What the DQ group needs are programs that look at data, in an application-independent manner, and identify records whose data are wrong on their face. As an example from the previous article, we mentioned the month of “Yraurbef,” which is always wrong; this article was not written on June 31; and my salary does not contain “abc”.

The program that does the checking can run at the same time output is being produced for regular production purposes; i.e., data integration / transformation, migration / replication, reporting / provisioning, and data masking. Given large amounts of data, and the likelihood of not knowing everything to predefine or verify, the prudent approach is to run the jobs iteratively until all the errors are identified and corrected, or otherwise noted.

Templates & Composites

A template is a structure used by SortCL programs (in the IRI CoSort package) to describe a particular format of a source and target data field, like dates, as well as any literal characters that always occur within it. The syntax and semantics of this structure are well known to C programmers. Readers who are not familiar with C can quickly learn the options. Templates are a way to define a new, custom data type, or master data format.

Using these canned structures, or library routines, also allows the I/O system to create and recognize data in a precise format for quality assessment. On input, values are extracted according to the template you define, and are transferred into SortCL internal values. On output, the internal values are written in the proper form specified by the output template. Whether reading or writing, if the data are not in the specified form, an error message is given.

Templates support simple and composite values. A composite value is one or more simple or composite values grouped together under a single name. The idea is to fix individual items relative to each other. The composite can also contain static values; i.e., character strings, special characters, and punctuation. Such static values allow the composite forms to conform to internal standards, and to be more familiar and understandable to end-users.

Templates can be constructed in the IRI Workbench GUI (built on Eclipse™) or scripted directly in SortCL programs with a text editor. Importantly, existing templates can be centralized and re-used to maintain data conformity between applications. For those taking an enterprise view, templates are a corporate resource to be shared – named, centrally maintained, and imported into applications as needed.


Template Use Examples – Finding Format Errors

Example 1. Finding Errors in Text and Dates

/TEMPLATE=(abc_dates = "abc %b/%e/%y")
/INFILE=dates
     /FIELD=(f1,STRUCTURE = abc_dates(month,day,year))
. . . .

The abc_dates template describes records that look precisely like this:

4 characters |abc |
3 character month name (Jan Feb ..) followed by a slash “/”
2 digit day of the month followed by a slash “/”
2 digit year 

dates (input):

abc Jan/29/13
aBc Dec/ 2/01
abc Jun/31/69
abc Ocp/12/71
Error Message                       Reason
@ dates + 2:02 'B'!= b'             abc_dates" 'B' not equal to 'b' at 2nd char
@ dates + 3 date [1969 6 31]        abc_dates" no such date
@ dates + 4:05 month |Ocp/12/71|    abc_dates" no such month

Note: Names of months (e.g., Jan, January …) and names of days (e.g., Tue, Tuesday …) are recognized and generated according to the LOCALE setting.

Example 2. Finding Errors in Numeric Representation

/TEMPLATE = (Bill_fmt = "%3i %9s %7.2f")
/INFILE = Purchases
    /FIELD = (Bill_fmt, STRUCTURE = Bill_fmt(Amt, Item, Price))
/REPORT
/OUTFILE= Results
    /FIELD = (Amt, POSITION = 5, SIZE = 5.0, TYPE = NUMERIC)
    /FIELD = (Item, POSITION = 12, SIZE = 9, TYPE = ASCII)
    /FIELD = (Price, POSITION = 21, SIZE = 10, TYPE = NUMERIC)
Purchases (input)      Results (output)
4 Carrots 3.29         4 Carrots 3.29
.2 Onions 1.98         0 Onions 1.98
10 Potatoes 3.26       10 Potatoes 3.26
1O Chickens 13.42      1 Chickens 13.42
10 Chickens 13,42      10 Chickens 13.00

Without looking at the next paragraph, the reader might want to try to explain why the output has bad results. The output looks right, and would be reported as such, or become part of any calculations. Fortunately, messages are being generated because the input data does not conform to the specified requirements of its template. These messages are displayed:

@ Purchases + 2:01 value | .2 | "Bill_fmt"
@ Purchases + 4:01 value | 1O | "Bill_fmt"
@ Purchases + 5:17 value |13,42| "Bill_fmt"

On Record 2, Amt may only be an integer of length 3; on record 4, the “0” in “10” is really the capital letter “O”; and on record 5, the decimal point is really a comma “,”. This is easy to see when the messages are given; but without them, the errors would not be found if they were nestled in a million records.


Next Article – Finding Context Errors

In our next discussion, we will talk about tools for discovering data that is application dependent –- data that looks possible but is wrong in context.

Topics include, finding:

  • membership or non-membership in a set of values
  • low_value ≤ actual value ≤ high value
  • data that is a function of other values (e.g., range ≠ distance * time)
Metadata Management: Asset Security in Git (Part 4 of 4)
CLF and ELF Web Log Formats
composite value composites data format errors data governance data quality master data format Master Data Management MDM templates

Related articles

The IRI Platform
IRI Data Governance
IRI Data Quality and Improvement
Automating IRI Jobs Using File…
SortCL-Compatible Excel Data Processing Examples
Processing Data in, and for,…
IRI Product Nomenclature & Architecture
Data Generation Rules in IRI…
The IRI-Windocks Test Database Repository
This image shows base endpoints of the DarkShield API for searching and masking text
IRI DarkShield RPC API
Connecting IBM DB2 with IRI…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact