Using CoSort to Speed Pentaho Sort Jobs
This article is the first in a 3-part series on using IRI products to expand functionality and improve performance in Pentaho systems. We first demonstrate how to improve sorting performance, and then introduce ways to mask production data, and create test data, in the Pentaho Data Integration (PDI) environment.
Since 1978, IRI CoSort has been used to accelerate or replace third-party sort functions or sort process steps. CoSort is a standalone product, and the default data manipulation engine in IRI Voracity data management platform operations. Licensees of either product can run CoSort jobs in the free IRI Workbench GUI, on the command line, from Pentaho, etc.
Pentaho Data Integration (PDI) software includes a native sort that may not run fast enough for your high volume inputs. However, PDI process flows support the use of third-party functions, so data can be sorted externally without undue process disruption. By using PDI’s shell script step to call a CoSort job (e.g., SortCL script), sorting times can be reduced dramatically.
Pentaho and CoSort users can create a SortCL sort script in a text editor or via the new sort job wizard in the IRI Workbench GUI, built on Eclipse™. They must also create a batch file to tell Pentaho to run the CoSort command:
This CoSort job sorts a one-million-row CSV file (10.4 MB) on a 2.8 GHz Windows 8 PC using 2 of its 4 cores and 3 of its 12 GB of RAM.
In PDI, create a job that uses a Start step and a Shell step referencing the batch file created above. To run multiple sorts, add multiple SortCL commands to the same batch file referencing the various scripts.
Benchmarks show that using the Pentaho/CoSort hybrid is 14-16 times faster than using the native sort step in Pentaho alone. The chart below shows the number of seconds it takes to sort 1 million (10.4 MB file), 25 million (238 MB file), and 100 million (953 MB file) CSV rows with each method on the same PC (above).
The one-million-row sort in CoSort was so fast at only 1 second, its timing didn’t display well on this graph. Unlike CoSort, tuning PDI to hold more than 1 million records in memory hung those native sort jobs.
When sorting millions of records, the difference easily adds up.
Beyond sorting, the CoSort SortCL program performs a number of additional transformations at the same time, plus cleanse, migrate, federate, protect, and report on data in disparate sources. Thus, even if you use Pentaho for many activities, you may find offloading certain slower-running steps to CoSort is more efficient in high-volume circumstances.
Click here to learn about a similar approach to masking production data in Pentaho to protect PII and comply with data privacy laws.