Skip to content
IRI Logo
Solutions Products
  • Solutions
  • Products
  • Blog
  • BI
  • Big Data
  • DQ
  • ETL
  • IRI
    • IRI Business
    • IRI Workbench
  • Mask
  • MDM
    • Master Data Management
    • Metadata Management
  • Migrate
    • Data Migration
    • Sort Migration
  • Test Data
  • Transform
  • VLDB
  • VLOG

What is Hadoop?

  • by Jeff Simpson

Hadoop is an increasingly popular computing  environment for distributed processing that business can use to analyze and store huge amounts of data. Some of the world’s largest and most data-intensive corporate users deploy Hadoop to consolidate, combine and analyze big data in both structured and complex sources.

With Hadoop, and its MapReduce programming language (and later variations like Spark, Storm, and Tez), high-volume data processing operations can scale up from running on one server to several thousand machines at once, harnessing the computing power on a managed grid.

Today, companies like Google, Yahoo, Facebook, Ebay and Linkedin use Hadoop. It’s for that reason major industry vendors IBM, Oracle, Informatica and Microsoft are positioning themselves on Hadoop, and long-time competing innovators like IRI (The CoSort Company), have as well. Both sides recognize that Hadoop is becoming a cost effective way to work with petabytes of data.

What makes Hadoop more powerful than previous distributed processing technologies is that it can run on a large number of machines that don not share memory or disks. Hadoop breaks the data into smaller pieces, distributes those pieces across the grid, and merges the results automatically on the desired target platform. In addition, it has the intelligence to balance workloads, and recover from individual node failures through redundancy.

IRI has always been a big data vendor, processing data outside databases to improve performance and leverage standard file systems. The Hadoop File System (HDFS) is the applicable equivalent in this case. IRI began working with Hadoop innovators in S.E. Asia (Solusi247) in 2014 to distribute and optimize CoSort-compatible transformations and FieldShield data masking functions across large grids. RowGen-compatible test data generation is next.

By 2017, IRI’s modern platform for “total data management” — called Voracity — began running the above jobs either via the default SortCL engine,  or seamlessly in Map Reduce 2, Spark, Spark Stream, Spark Stream, and Tez. Support is also available for data streaming through Kafka, etc., compressed formats like Parque, and both SQL and NoSQL databases compatible with Hadoop.

The results of IRI’s map-once-deploy-anywhere options are significant price-performance gains for big data integration (ETL) architects and data scientists, as well as data governance officers dealing with PII in JSON and other sources. That is not only because of the relatively low cost of Voracity subscriptions, but because there is no need to learn to program in any language to get work done. The free IRI Workbench GUI, built on Eclipse, makes job design a graphical affair, and coding in Hadoop moot.

Check out this article to help you decide when Hadoop should be used, and this article for how to connect to HDFS and run jobs seamlessly in Voracity.

Top 5 Reasons To Network Online
Do You Use Change Data Capture Solutions?
big data CoSort hadoop iri warehouse

Related articles

Prepare and Protect Data for…
Running IRI Software in a…
The IRI Platform
IRI Test Data Generation
IRI Data Masking
IRI Data Governance
IRI Data Quality and Improvement
IRI Data Migration and Modernization
IRI Voracity and Test Design…
All About IRI Set Files:…
Real-Time, Incremental Data Masking
3 COMMENTS
  • sindhu
    October 6, 2017 at 3:51 am
    Reply

    This is really a very informative article on Hadoop, and it was easy to understand. Please keep updating your blog like this.

  • system design and architecture
    September 20, 2017 at 12:49 am
    Reply

    excellent introduction

  • hadoop guy
    August 2, 2017 at 9:21 am
    Reply

    Perfect article on what Hadoop is. I found many websites writing about it also, but their content is very difficult to read for beginners. IRI is awesome for making this blog definition about Hadoop so very clear.

    Regards
    Kuldeep

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Big Data 66
  • Business Intelligence (BI) 77
  • Data Masking/Protection 163
  • Data Quality (DQ) 41
  • Data Transformation 94
  • ETL 122
  • IRI 229
    • IRI Business 86
    • IRI Workbench 162
  • MDM 37
    • Master Data Management 12
    • Metadata Management 25
  • Migration 65
    • Data Migration 60
    • Sort Migration 6
  • Test Data 102
  • VLDB 78
  • VLOG 40

Tracking

© 2025 Innovative Routines International (IRI), Inc., All Rights Reserved | Contact