Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

Bucketing Data Values Using Set Files

  • by Don Purnhagen

In data warehouses, it is common to map discrete data values to a set of ranges. This makes it easier to write queries that involve a range of discrete values. Consider the following two SQL queries that both generate information about sales during a fiscal quarter:

SELECT MIN(QTY_SOLD), MAX(QTY_SOLD), SUM(QTY_SOLD) FROM SALE WHERE
        SALE_DATE >= TO_DATE('20110101', 'yyyymmdd') AND
        SALE_DATE < TO_DATE('20110401', 'yyyymmdd');

SELECT MIN(QTY_SOLD), MAX(QTY_SOLD), SUM(QTY_SOLD) FROM SALE WHERE
        SALE_QTR == '2011Q1';

You can see that the second query has a simpler WHERE clause. When working with complex join queries, involving many dimension tables, the complexity of the first approach increases with each discrete value range that must be filtered.

When data is moved from a production system to a data warehouse employing data ranges, the discrete data values must be replaced with identifiers that will map them into the proper range, or bucket. When using the IRI Voracity ETL platform, or a component IRI product such as CoSort, NextForm, FieldShield, or RowGen, to move data into a data warehouse, there are a couple of techniques that can be used to assign data values to a particular bucket.

For this example[1], we will use a table of items, which includes the item price. When we load the dimension table for the items in the data warehouse, we want to assign the items to one of three price ranges. This is a simplistic, and somewhat contrived example, but the concept will extend to more realistic use cases.

Below, ITEM.ITEM_PRICE is a column which contains a number in the range of 0.01 to 99.99. We want to group the items into buckets by price range. The price range dimension table looks like this:

Table: ITEM_PRICE_RANGE
PRICE_RANGE PRICE_START PRICE_END
LOW 00.01 19.99
MID 20.00 69.99
HIGH 70.00 99.99

The simplest way to assign the items into the three buckets, as described above, would use an IF-THEN-ELSE statement in the output section of your (SortCL) job (script). The output field for PRICE_RANGE could contain:

/FIELD=(PRICE_RANGE, IF PRICE <= 19.99 THEN "LOW" \
        ELSE IF PRICE <= 69.99 THEN "MID" \
        ELSE "HIGH"

When the range of discrete values is greater, and there are many more buckets however, this approach gets unwieldy. A different approach using set files can simplify your job scripts.

A set file for IRI products is a plain text file with data values on each line. A simple set file contains just one data value per line, and is often used to select random, realistic looking values when generating test data with IRI RowGen. There can also be set files with multiple, related data values on each line, separated by a tab character (ASCII 9).

For the purpose of assigning bucket identifiers in this case, we will use a set file where the end value of the range is the first value, and the name of the range is the second value. Consider this file, that we saved with the filename price-range.set:

19.99        LOW
69.99        MID
99.99        HIGH

The use of this technique requires a much shorter, and simpler field statement in the IRI job script. Furthermore, a change in the script is not necessary to change the bucket start or end points.

To change the range for a bucket, just edit the set file values. You can even add additional buckets without any further modification of the job script field statement.

The field statement to use a set file as described above looks like this:

/FIELD=(BUCKET, TYPE=ASCII, POSITION=2, SEPARATOR="\t", \
        SET="price-range.set"[PRICE] DEFAULT="*" SEARCH=GE)

If we wanted to increase the number of buckets to five in this example, we could edit the price-range.set file to look like this:

09.99        LOW
24.99        MIDLOW
49.99        MID
69.99        MIDHIGH
99.99        HIGH

We could run the IRI job script again with the new set file, and the bucket values would change as the new set file is used.

You can even use an IRI job script to build the set file from the range table. Here is a sample script that will extract the range end and index from our sample table:

# Generated with the IRI Workbench - New Reformat Job
# Extract the ID and END value for a lookup set to select buckets.
/INFILE="SCOTT.ITEM_PRICE_RANGE;DSN=Oracle;"
    /PROCESS=ODBC
    /FIELD=(PRICE_RANGE, TYPE=ASCII, POSITION=1, SEPARATOR="\t")
    /FIELD=(PRICE_END, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
/REPORT
/OUTFILE=price-range.set
    /FIELD=(PRICE_END, TYPE=NUMERIC, POSITION=1, SEPARATOR="\t", \
        SIZE=5, PRECISION=2, FILL='0')
    /FIELD=(PRICE_RANGE, TYPE=ASCII, POSITION=2, SEPARATOR="\t")

To put it all together, here is an IRI job script that would read the item values from a relational table, and load them into a dimension table. It also converts the price into a bucket based on the range values we extracted from the ITEM_PRICE_RANGE table:

# Generated with the IRI Workbench - New Sort Job
# Load item dimension table, converting price into a bucket value.
/INFILE="ITEM;DSN=Oracle;"
/PROCESS=ODBC
    /FIELD=(ITEM_ID, TYPE=NUMERIC, POSITION=1, SEPARATOR="\t")
    /FIELD=(ITEM_NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
    /FIELD=(PRICE, TYPE=NUMERIC, POSITION=3, SEPARATOR="\t")
    /FIELD=(CATEGORY_ID, TYPE=NUMERIC, POSITION=4, SEPARATOR="\t")
/SORT
    /KEY=(PRICE)
/OUTFILE="ITEM_DIM;DSN=Oracle;"
    /PROCESS=ODBC
    /FIELD=(ITEM_ID, TYPE=ASCII, POSITION=1, SEPARATOR="\t")
    /FIELD=(ITEM_NAME, TYPE=ASCII, POSITION=2, SEPARATOR="\t")
    /FIELD=(PRICE_BUCKET, TYPE=ASCII, POSITION=3, SEPARATOR="\t", \
        SET="price-range.set"[PRICE] DEFAULT="*" SEARCH=GE)
    /FIELD=(CATEGORY_ID, TYPE=ASCII, POSITION=4, SEPARATOR="\t")

[1] This example uses IRI job scripts to move and transform the data values. For a primer on such scripts, see the SortCL program chapter in this CoSort technical overview booklet.

 

UUID Support in IRI Products
Drawing Values from Set Files
bucket data value data warehouse dimension table IRI CoSort IRI NextForm IRI RowGen job script SortCL SQL queries

Related articles

Connecting MariaDB and MySQL to…
Running IRI Software in a…
The IRI Platform
IRI Test Data Generation
IRI Data Quality and Improvement
IRI Data Migration and Modernization
IRI Voracity and Test Design…
Creating Set Files in IRI…
All About IRI Set Files:…
Real-time Database Data Replication
Getting Started with IRI Ripcurrent
1 COMMENT
  • Anonymizing Indirect Identifiers to Lower Re-ID Risk - IRI
    August 8, 2018 at 1:40 pm
    Reply

    […] this example, and building upon the article here, we will show how FieldShield or Voracity users can use their free IRI Workbench graphical jog […]

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact