Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

IRI Test Data Generation

  • by Daniel Howard, Bloor Research

We started this series of articles by talking about test design automation and the need to introduce automation throughout your testing processes. In this blog, we come full circle to talk, once again, about testing.

Now, however, we are going to discuss test data generation specifically. This forms a significant subset of test data management, being that it is an integral part of keeping your test data secure while allowing it to retain the characteristics that are important to your tests. To wit, either you will want to find any sensitive data within the data you want to test, then replace much or all of it with masked data; or you will want to generate entire sets of synthetic data (which is to say, data that is realistic but not real) for testing purposes. Or, quite possibly, both. Regardless, you will need some degree of test data generation capabilities. For generating masked data, see the previous article in this series, in which we discuss this topic at length. For synthetic data generation, read on, although note that – at least in IRI’s case – the two can quite readily be combined together.

Synthetic data generation can either be thought of as an alternative to the traditional method of generating test data via subsetting and masking or as an addition to it, in which it is used to generate the masked data that is used to replace your sensitive data. In either case, one of the most important aspects to understand about synthetic data generation is that it does not just generate random data. Rather, it uses sophisticated methods to analyze the structure of an existing data set, then produces a new data set composed of data that is entirely fake individually but that possesses the same statistical properties of the original data set when considered as a whole. This is sometimes referred to as “preserving statistical integrity” (in contrast to preserving referential integrity, which is vitally important for masking data consistently across relational databases while maintaining existing relational structures in the masked data). Thus, you end up with a selection of entirely safe data that cannot possibly be used to identify an individual but is still just as useful as the original, sensitive data set for the purposes of testing. That is the ideal, anyway – the degree to which various vendor offerings actually achieve this varies considerably.

IRI Voracity – or, more specifically, IRI RowGen – is available for generating “realistic but not real” synthetic data. It places particular emphasis on test data customization: on giving you fine-grained control over what data is generated, and moreover, how and where it is generated. For instance, at a basic level, it can either generate test data based on available information provided to it or select data randomly from a “set file” that has been prepared ahead of time, either by hand or in IRI Workbench. These set files may themselves consist of synthetic data, or of real data that has been isolated from any associated data to the point that it is not identifying. Set files can also be simple lists or have multiple columns.  The general idea is that multiple set files can be drawn from simultaneously to create a holistic data profile for a person or other entity that doesn’t actually exist, but that has realistic attributes drawn from your actual data.

Various generation functions are available for creating test data sets, including both the specific – say, national ID number generation – and the generic – such as generating data according to a predefined, weighted statistical distribution. There are multiple ways to customize the end results of these functions: test data can be generated in such a way that each value is unique, each value in a set file can be mandated to be used exactly once, and so on. You can even define your own compound data formats.

In short, whether synthetic test data is randomly generated or selected, its production characteristics – including original data formats and sizes, value ranges, key relationships, and frequency distributions – are preserved. Basically, there is a lot of customization of test data available, and it should be obvious that this can be useful for tailoring your generated test data to your specific business needs. Moreover, this extends past just what data you are generating and also encompasses how and where you are generating it (which means that you could, for instance, generate data as part of a CI/CD pipeline).  This adds further depth of functionality to the test data generation, and more specifically the test data customization, offered by RowGen in particular and Voracity in general.

IRI Data Masking
The IRI Platform
IRI Voracity test data test data generation

Related articles

Masking RDB Data in the…
Find & Mask File PII…
Data Class & Rule Library…
Prepare and Protect Data for…
Connecting MariaDB and MySQL to…
Running IRI Software in a…
The IRI Platform
IRI Data Masking
IRI Data Governance
Pseudonym Hash Set (File) Creation…
Consistent, Self-Updating and Secure Pseudonymization

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact