You are currently browsing the tag archive for the ‘MapReduce’ tag.

Informatica recently introduced HParser, an expansion of its capabilities for working with Hadoop data sources. Beginning with Version 9.1, introduced earlier this year, Informatica’s flagship product has been able to access data stored in HDFS as either a source or a target for information management processes. However, it could not manipulate or transform the data within the Hadoop environment. With this announcement, Informatica starts to bring its data transformation capabilities to Hadoop.

We recently completed benchmark research on Hadoop, the open source large-scale processing technology, and I have been writing regularly about Hadoop in this blog. In this era of big data, Hadoop has quickly become a popular technique for storing and analyzing big data; more than half (54%) of participants in our research are either using or evaluating the technology. The research shows that most often, they use Hadoop in conjunction with unstructured data including application logs and event data. Before someone can analyze this data, it must be parsed to determine the various bits of information recorded in each row of the log files.

Informatica’s HParser is designed to make this process easier. Using DT Studio, Informatica’s Eclipse-based integrated development environment (IDE), organizations can create data transformation routines via a graphical user interface that parses the information in log files and other types of data typically processed with Hadoop. Once developed, these routines get deployed to the Hadoop cluster and are invoked as part of the MapReduce scripts, which enables them to use the full distributed processing and parallel execution capabilities of Hadoop. Using a graphical environment to develop these routines should make it easier and faster to create the code necessary to parse the data. As our research shows, staffing and training are the two biggest obstacles to leveraging Hadoop, so tools like HParser that can minimize the specialized skills required can be valuable to organizations deploying Hadoop.

Informatica is making two versions of HParser available. The community edition is free, but it’s not open source. It can be used to process log files, Omniture Web analytics data, XML documents and the JavaScript Object Notation (JSON) data interchange format. The enterprise edition also supports a number of industry-standard data formats including SWIFT, X12, and NACHA for the financial industry, HL7 and HIPAA for healthcare, ASN.1 for telecommunications, and documents in PDF, XLS or Microsoft Word formats. For the most part, the enterprise offering is targeted for those in the Informatica user base who might be extending their efforts into Hadoop. The community edition may provide enough value for customers not currently working with Informatica to consider trying some of the company’s other products.

We’ve seen other information management vendors take a similar approach. Earlier this year Syncsort announced a free version of its sort routines for the Hadoop market as well as an enterprise edition. HParser appears to be part of a bigger effort on the part of Informatica to embrace Hadoop. The company has been conducting a series of webcasts called Hadoop Tuesdays, one of which I participated in one last week, to help educate the market about Hadoop. You may find these useful if you want to learn more about Hadoop. They are not product sales pitches but are focused on explaining the technology and its uses. In addition, Informatica will be delivering a keynote presentation at Hadoop World next week.

We’ve seen business intelligence vendors and information management vendors alike embrace Hadoop. I expect we’ll continue to see more investment from Informatica and others as organizations work to make Hadoop a disciplined part of their IT infrastructure processes. As our research shows, integration is one of the top four issues for organizations working with Hadoop. The more that existing products can be extended to incorporate Hadoop or new products can be developed to make Hadoop easier to use, the more widespread its usage will become.

Die-hard MapReduce programmers may not feel that they need HParser. However, enterprise IT organizations already using Informatica should find it a welcome addition in their efforts to deal with Hadoop-based data sources. You can give it a try for yourself here.


David Menninger – VP & Research Director

Informatica has announced version 9.1 for Big Data.  I wrote previously about Informatica 9.1,the latest iteration of the company’s data integration platform, following its industry analyst summit. At that event in February, the company officials alluded to future plans regarding Hadoop and other big-data sources yet to be finalized. This announcement reveals those plans. Informatica will support three types of “big data”: big transaction data from relational databases and data warehouse system, big interaction data from social media, customer interaction systems and other systems, and big data processing, which means Hadoop, the open source software framework. Let’s look at each of these types.  

With respect to relational databases, Informatica adds support for additional analytic databases so its PowerCenter connectors are now available for “traditional” database alternatives including IBM DB2, Microsoft SQLServer, Oracle and Sybase as well as analytical databases and data warehouse systems from Aster Data, Greenplum, Netezza, ParAccel, Teradata and Vertica. While Hadoop gets a lot of attention these days, it’s important to recognize that big data also exists in these other sources. Many of the customers of these vendors probably use Informatica already and will benefit from having official support for their configurations.   

Social media and other customer interaction data are important sources for companies seeking to build a complete view of the customer. My colleague Richard Snow has written about the role of social media in this context, and our firm has conducted benchmark research on other customer interaction technologies. With version 9.1, Informatica makes it easier to collect social media data and includes specific connectors for Facebook, LinkedIn and Twitter.  

Informatica’s developments around big-data processing and Hadoop will come in two phases. The first phase, which the company said will be “shipping soon,” provides access to data stored in HDFS as both a target and a source for Informatica processes. A second phase in a future release will provide graphical codeless development of Hadoop MapReduce jobs, which will support preparing and integrating data in Hadoop. While phase one begins to incorporate Hadoop, the additional features of phase two are necessary to make Hadoop a first-class citizen in the Informatica ecosystem. Smaller, more nimble vendors such as Karmasphere are offering graphical development capabilities today, and Informatica will need to offer these as well to compete.    

As part of the launch, Informatica enlisted Tim Leonard, chief technology officer of U.S. Xpress, to talk publicly about its use of Informatica. This transportation company has an innovative application combining large amounts of streaming real-time data, location intelligence and mobile devices. The application enables U.S. Xpress to combine driver location and other data to reduce fuel consumption costs as well as provide better customer service through more detailed information about delivery schedules and the ability to reroute deliveries when necessary.    

So although Informatica is moving more slowly than some smaller vendors on particular features such as graphical development of Hadoop jobs, the U.S. Xpress application provides an example of the value of working with a vendor that has such an extensive portfolio of products. That customer is able to source from a single vendor data integration capabilities to handle big data, streaming data and location-based data. This is a promising position for Informatica. 


David Menninger – VP & Research Director

Follow on

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other followers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

David Menninger – Twitter

Top Rated

Blog Stats

  • 46,527 hits
%d bloggers like this: