Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG
Graphs and Charts

Data Profiling: Discovering Data Details

  • by Dale Robson

Data profiling, or data discovery, refers to the process of obtaining information from, and descriptive statistics about, various sources of data.  The purpose of data profiling is to get a better understanding of the content of data, as well as its structure, relationships, and current levels of accuracy and integrity.

Data profiling may reveal errors in, or false conclusions around, metadata (data about data). Finding these problems early on helps improve the quality of source data prior to integrating or storing it in a data warehouse. Understanding the attributes of data in a database table or extracted file, and inspecting data values, helps validate that data content actually matches its metadata definition. Seeing the data and metadata also helps to identify which items are sensitive, or contain personally identifiable information (PII), so that certain columns can be flagged for protective measures. Data profiling thus discovers the characteristics of source data necessary for the identification, use, and lineage of data in integration, security, reporting, and other processes that follow.

Although collected data can oftentimes seem benign or useless, especially when gathered from multiple sources, keep in mind that all data may be useful with the proper application or algorithm.  Data profiling is thus also a first step in determining that usefulness (by improving understanding of the data itself).

Since many businesses ultimately rely upon raw data sources for insight into things like product inventories, client demographics, buying habits, and sales projections, a company’s ability to benefit competitively from ever-increasing data volumes can be directly proportional to its capacity to leverage those data assets.  Winning/losing customers and succeeding/failing as a business could very well be determined by the specific knowledge an organization’s collected data imparts.  Thus identifying the right data, establishing its usefulness at the right level, and determining how to manage anomalies — are essential in the design of data warehousing operations and business intelligence applications.

According to Doug Vucevic and Wayne Yaddow, authors of Testing the Data Warehouse Practicum, “…the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.  The result of the analysis is used both strategically–to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, but tactically, to identify problems for later solution design, and to level sponsors’ expectations.”

Data authorities recommend performing data profiling randomly and repetitively on limited amounts of data, instead of trying to tackle large, complex volumes all at once.  That way the discoveries can be determining factors for what should be profiled next.  Identifying data rules, restrictions, and prerequisites, ensure the integrity of the metadata on which future profiling is performed.  Knowing what is supposed to be in certain data files and what is actually there may not be the same thing.  So whenever the quality or characteristics of a new source is unknown, experts suggest data profiling first, before any integration into an existing system.

Steps in the data profiling process include:  importing all objects, creating configuration parameters, performing the actual profiling, and analyzing the results; none of which are as easy as they sound!  Then based upon the findings, schema and data corrections can be implemented, as well as other fine tuning for subsequent data profiling performance improvement.

IRI Profiling Tools

In mid-2015, IRI released a series of free database, structured, and unstructured (dark) data discovery tools in its Eclipse GUI, IRI Workbench. They are summarized at http://www.iri.com/products/workbench/discover-data and link to other articles in this blog which go into more detail.

Data Profiling Image showing the IRI Workbench metadata discovery wizard
The metadata discovery wizard in the IRI Workbench, built on Eclipse. In this case, Oracle table structures are automatically parsed and generated for shared use in data definition file (DDF) repositories that support CoSort, FACT, NextForm, FieldShield, RowGen, BIRT and other operations. See a demo video here.

 

 

CoSort Acceleration for SAP Business Objects Business Intelligence
Generating Test NID Data: Korean Social Security Numbers
data discovery data profiling data warehouse management IRI Workbench metadata discovery metadata management personally identifiable information PII very large database VLDB

Related articles

DarkShield PII Discovery & Masking…
Masking Flat Files in the…
Directory Data Class Search Wizard
Masking PII in a Relational…
IRI Data Class Map
Schema Data Class Search
Training NER Models in IRI…
Masking NoSQL DB PII in…
Masking RDB Data in the…
IRI DarkShield-NoSQL RPC API
Find & Mask File PII…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact