Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

Using CoSort to Speed Pentaho Sort Jobs

  • by Claudia Irvine

This article is the first in a 3-part series on using IRI products to expand functionality and improve performance in Pentaho systems. We first demonstrate how to improve sorting performance, and then introduce ways to mask production data, and create test data, in the Pentaho Data Integration (PDI) environment.

Since 1978, IRI CoSort has been used to accelerate or replace third-party sort functions or sort process steps. CoSort is a standalone product, and the default data manipulation engine in IRI Voracity data management platform operations. Licensees of either product can run CoSort jobs in the free IRI Workbench GUI,  on the command line, from Pentaho, etc.

Pentaho Data Integration (PDI) software includes a native sort that may not run fast enough for your high volume inputs. However, PDI process flows support the use of third-party functions, so data can be sorted externally without undue process disruption. By using PDI’s shell script step to call a CoSort job (e.g., SortCL script), sorting times can be reduced dramatically.

Pentaho and CoSort users can create a SortCL sort script in a text editor or via the new sort job wizard in the IRI Workbench GUI, built on Eclipse™. They must also create a batch file to tell Pentaho to run the CoSort command:

pentaho-sort-1

This CoSort job sorts a one-million-row CSV file (10.4 MB) on a 2.8 GHz Windows 8 PC using 2 of its 4 cores and 3 of its 12 GB of RAM.

In PDI, create a job that uses a Start step and a Shell step referencing the batch file created above. To run multiple sorts, add multiple SortCL commands to the same batch file referencing the various scripts.

pentaho-sort-2

pentaho-sort-3

Benchmarks show that using the Pentaho/CoSort hybrid is 14-16 times faster than using the native sort step in Pentaho alone. The chart below shows the number of seconds it takes to sort 1 million (10.4 MB file), 25 million (238 MB file), and 100 million (953 MB file) CSV rows with each method on the same PC (above).

pentaho-sort-4

The one-million-row sort in CoSort was so fast at only 1 second, its timing didn’t display well on this graph. Unlike CoSort, tuning PDI to hold more than 1 million records in memory hung those native sort jobs.

When sorting millions of records, the difference easily adds up.

Beyond sorting, the CoSort SortCL program performs a number of additional transformations at the same time, plus cleanse, migrate, federate, protect, and report on data in disparate sources. Thus, even if you use Pentaho for many activities, you may find offloading certain slower-running steps to CoSort is more efficient in high-volume circumstances.

Click here to learn about a similar approach to masking production data in Pentaho to protect PII and comply with data privacy laws.

(ACU)COBOL Vision File Conversion and Processing
Masking Data in Pentaho
faster sorts IRI Workbench make pentaho sort faster pentaho process pentaho data faster replace third party sorts sort replacement speed pentaho third-party sort functions

Related articles

Connecting MariaDB and MySQL to…
Running IRI Software in a…
The IRI Platform
IRI Test Data Generation
IRI Data Quality and Improvement
IRI Data Migration and Modernization
IRI Voracity and Test Design…
Creating Set Files in IRI…
All About IRI Set Files:…
Real-time Database Data Replication
Getting Started with IRI Ripcurrent

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact