Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

Finding Dark Data in Unstructured Sources

  • by Sharon Hewitt
Update April 2019: Added more unstructured file formats.

This is the first of a three-part blog series introducing IRI’s new data structuring technology. This article defines “dark data” and the unstructured sources IRI now supports. The second article shows how the Data Restructuring wizard works, and the third shows how the restructured data can be used by all IRI software products.

According to Gartner Analyst Douglas Laney, “enterprise dark data” is “unutilized or underutilized information, collected generally for a single purpose — then forgotten or archived.”1  Much of the dark data that corporations have is in unstructured data repositories.

What is unstructured (vs. structured) data? According to Wikipedia, unstructured data is “information that either does not have a pre-defined data model or is not organized in a pre-defined manner”. It’s data that are not organized or classified in a way that can be easily grouped by subject; it’s mostly textual, but can also be images, audio, and video.

And let’s not forget social media. Facebook, Twitter, LinkedIn, Pinterest, just to name a few – these all contain unstructured and semi-structured data. Valuable data that can be very beneficial to businesses, large and small. However, it really needs to be structured before it becomes useful.

Structured data is of course the opposite of unstructured data. Webopedia defines structured data as “data that resides in a fixed field within a record or file.” It’s organized, and relies on a model that determines how the data is stored, processed, and accessed. Structured Query Language (SQL) is often used for managing structured data in database tables, just as SortCL data definition files (DDF) in IRI CoSort define the layouts of external, flat files.

Semi-structured data is a cross between both structured and unstructured data. It has structured data but doesn’t fit into the formal models of relational databases or other sequential sources. Legacy (mainframe index) files are a good example of this hybrid, because they consist of structured elements and proprietary layouts. Many XML files may fall into this category, too, although there are also tons of flat (structured) and unstructured (free-form) XML documents.

IRI software traditionally handled big data only in structured sources; i.e. all kinds of flat file formats and relational database tables that are extracted or reached via ODBC. But now it can also extract, structure, and process data in several semi- and unstructured data sources, including:

Unstructured Files (using the Data Structuring wizard in the IRI Workbench GUI, built on Eclipse™)

  • Free-form text (.txt)
  • Microsoft Word documents (.doc and .docx)
  • Adobe Portable Document Format (.pdf)
  • Extensible Markup Language (.xml)
  • E-mail messages (.eml)
  • Microsoft Excel spreadsheets (.xls and .xlsx)
  • Microsoft PowerPoint presentations (.ppt and .pptx)
  • Microsoft Exchange and Outlook (.osd, and .pst)
  • Rich Text Format (.rtf)
  • Hypertext Markup Language files (.html)
  • JavaScript Object Notation files (.json)
  • Various image formats (.tiff, .jpeg, .png, .gif, .jp2, .jpx, .bmp)

Semi-structured Files

  • ASN.1 call detail record (CDR) files (via a CoSort / SortCL input procedure)
  • C-ISAM, IMS, QSAM, VSAM and other mainframe files (using partner ODBC drivers)
  • MF-ISAM and Vision index files (using embedded Micro Focus libraries)
  • MongoDB (JSON) and XML -using JDBC drivers in IRI Workbench

This structuring of data is all done by the Data Restructuring wizard. The Data Restructuring wizard is bundled with the Unstructured Data edition of the IRI NextForm data and database migration product.  The general idea is that, after parsing through the data in unstructured sources, you can output what you’re looking for into a structured text (flat) file, with its layouts automatically defined in a data definition file (.DDF). The file and its metadata repository are easily used and re-used by IRI software and/or fed to other applications all within the same Eclipse IDE, the IRI Workbench, for:

  • Data Integration and Transformation
  • Data Migration and Replication
  • Data Masking (Encryption, De-ID, etc.)
  • DB Load and Query Optimization
  • Reporting or Hand-offs to BI Tools
  • Population of CRM, DB, ETL, and External Apps

If you would like to see how to use the Data Restructuring wizard, you can visit the next article Using the Data Restructuring Wizard to Unlock Unstructured Data. You can also see how to use the newly structured output file and its DDF in all IRI software in the blog Using CoSort on Restructured Data in the IRI Workbench.

 

1. [Gartner, “Answering Big Data’s 10 Biggest Vision and Strategy Questions,” Douglas Laney et al, August 12, 2014, p.5.]↩

NextForm v3: Five Options for Data and Database Migration
Using the Dark Data Discovery Wizard to Unlock Unstructured Data
data in text files data restructuring extract data from files file data regular expressions structured data unstructured data xml data

Related articles

IRI Data Class Map
Schema Data Class Search
Masking RDB Data in the…
Find & Mask File PII…
Importing Data Classes into the…
Data Class & Rule Library…
Prepare and Protect Data for…
Connecting MariaDB and MySQL to…
Sharing IRI Data Management Jobs…
Running IRI Software in a…
The IRI Platform
3 COMMENTS
  • Data Class Validation in IRI Workbench - IRI
    November 8, 2019 at 5:50 pm
    Reply

    […] this example provides a brief introduction to dark data discovery, you may find it useful to read this three part blog series that explores the feature in […]

  • Harinath Prabhakaran
    February 15, 2015 at 10:43 am
    Reply

    Can one combine the data reconstructing wizard with the database migration edition of NextForm also? I’d want to move the results of my search into database tables so I can query and process it in that environment sometimes, too.

    1. Eric Leohner
      February 15, 2016 at 12:03 pm
      Reply

      Yes, all the data discovery wizards are standard in the free IRI Workbench, and are thus available to all NextForm users, even users of the free Lite edition.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact