You are currently browsing the tag archive for the ‘Apache Hadoop’ tag.

Cloudera is riding the wave of big data. I first learned about the company while working at Vertica, one of Cloudera’s partners. Customers that managed large amounts of structured relational data also needed to process large amounts of semistructured data such as the type found in web logs and application logs. The emerging channel of social media provided another source of data lacking the structure that would lend itself to analysis in a relational database. Other organizations needed to perform calculations and analyses that were difficult to express in SQL. Seeing this market Cloudera recognized earlier than others an opportunity to leverage the Apache Hadoop project; it has been offering the Cloudera Distribution for Hadoop (CDH) since early 2009.

I first wrote about Cloudera last year after attending Hadoop World and seeing firsthand significant interest in Hadoop. Much has happened at Cloudera since then and also in the broader big-data market. Cloudera recently made CDH version 3 generally available. (My colleague Mark Smith wrote about CDH3 when it was first announced.) Cloudera says it intends to release additional distributions annually, so we should expect another release early to middle 2012, although the recent entry of competitors into the Hadoop distribution market might prompt Cloudera to accelerate its releases.

In addition to the open source CDH releases, Cloudera offers an enterprise product that combines CDH with support and a set of management applications for authorization, provisioning, monitoring and resource management. The company has been working on version 3.5 of Cloudera Enterprise and proposes a release cycle for the enterprise product about twice as often as the annual releases of CDH. Version 3.5 includes real-time activity monitoring, an expanded file browser to show how files are used and their ownership, and extended authorization management and administration.

Perhaps as significant as the software developments, Cloudera has solidified its place in the market with key customer wins, additional funding, an expanded executive team and new partnerships. Last October, Cloudera announced $25 million in funding.  Its partnership with Informatica announced last fall has borne fruit as part of Informatica 9.1, which I covered in a previous post. I’ve also covered Jaspersoft Version 4 whose features include support for Hadoop. In my opinion, these partners are pursuing Cloudera rather than the other way around.

Of course, success often provokes competition. Cloudera’s first-mover advantage in the Hadoop market has attracted attention in the form of alternatives to Hadoop both direct, such as EMC offering its own distribution of Hadoop, and indirect, such as LexisNexis offering an open source version of its high-performance cluster computing system.

We recently completed research on the market requirements around big data, the benefits of adopting one of these alternatives and the obstacles as well. This research, the first of its kind, is the largest, most comprehensive study of issues related to the big-data market. We’ll be sharing some of our preliminary findings in a webinar next week hosted by two of the research sponsors. Time will tell which of these alternatives will succeed. As I’ve expressed in previous posts, I like competition, and you should, too, because it spurs vendors to offer better products at lower prices.

Regards,

David Menninger – VP & Research Director

Last week I attended the IBM Big Data Symposium at the Watson Research Center in Yorktown Heights, N.Y. The event was held in the auditorium where the recent Jeopardy shows featuring the computer called Watson took place and which still features the set used for the show – a fitting environment for IBM to put on another sort of “show” involving fast processing of lots of data. The same technology featured prominently in IBM’s big-data message, and the event was an orchestrated presentation more like a TV show than a news conference. Although it announced very little news at the event, IBM did make one very important statement: The company will not produce its own distribution of Hadoop, the open source distributed computing technology that enables organizations to process very large amounts of data quickly. Instead it will rely on and throw its weight behind the Apache Hadoop project – a stark contrast to EMC’s decision to do exactly that, announced earlier in the week. As an indication of IBM’s approach, Anant Jhingran, vice president and CTO for information management, commented, “We have got to avoid forking. It’s a death knell for emerging capabilities.”

The event brought together organizations presenting interesting and diverse use cases ranging from traditional big-data stories from Web businesses such as Yahoo to less well known scenarios such as informatics in life sciences and healthcare, by Illumina and the University of Ontario Institute of Technology (UOIT), respectively, low-latency financial services by eZly and customer demographic data by Axciom.

Eric Baldeschwieler, vice president of Hadoop development at Yahoo, shared some impressive statistics about its Hadoop usage, one of the largest in the world with over 40,000 servers. Yahoo manages 170 petabytes of data with Hadoop and runs more than 5 million Hadoop jobs every month. The models it uses to help prevent spam and others that do ad-targeting are in some cases retrained every five minutes to ensure they are based on up-to-date content. As a point of reference CPU utilization on Yahoo’s Hadoop computing resources averages greater than 30% and at its best is greater than 80%. It appears from these figures that the Hadoop clusters are configured with enough spare capacity to handle spikes in demand.

During the discussions, I detected a bit of a debate about who is the driving force behind Hadoop. According to Baldeschwieler, Yahoo has contributed 70% of the Apache Hadoop project code, but on April 12, Cloudera claimed in a press release, “Cloudera leads or is among the top three code contributors on the most important Apache Hadoop and Hadoop-related projects in the world, including Hadoop, HDFS, MapReduce, HBase, Zookeeper, Oozie, Hive, Sqoop, Flume, and Hue, among others.” Perhaps Yahoo wants to reestablish its credentials as it mulls whether to spin out its Hadoop software unit. If such a spinoff were to occur, it could further fracture the Hadoop market.

I found it interesting that the customers IBM brought to the event, while having interesting use cases, were not necessarily leveraging IBM products in their applications. This fact led me to the initial conclusion that the event was more of a show than a news conference. Reflecting further on IBM’s stated direction of supporting the Apache Hadoop distribution, I wondered what IBM Hadoop-related products they would use. IBM will be announcing version 1.1 of InfoSphere BigInsights in both a free basic edition and an enterprise edition. The product includes Big Sheets, which can integrate large amounts of unstructured Web data. InfoSphere Streams 2.0, announced in April, adds Netezza TwinFin, Microsoft SQLServer and MySQL support to other SQL sources already supported. But this event was not about those products. It was about IBM’s presence in and knowledge of the big-data marketplace. Executives did say that the IBM product portfolio will be extended “in all the places you would expect” to support big data but offered few specifics.

IBM emphasized the combination of streaming data, via InfoSphere Streams, and big data more than other big-data vendors do. The company painted a context of “three V’s” (volume, velocity and variety) of data, which attendees, Twitter followers and eventually the IBM presenters expanded to include a fourth V, validity. To illustrate the potential value of combining streaming data and big data, Dr. Carolyn McGregor, chair in health informatics at UOIT, shared how the institute is literally saving lives in neonatal intensive care units by monitoring and analyzing neonatal data in real time.

Rob Thomas, IBM vice president of business development for information management explained the role of partners in the IBM big data ecosystem. As stated above, IBM will rely on Apache Hadoop as the foundation of its work, but will partner with vendors further up the stack. Datameer, Digital Resaoning,  and  Karmasphere all participated in the event as examples of the types of partnerships IBM will seek.

IBM has already demonstrated, via Watson, that it knows how to deal with large-scale data and Hadoop, but to date, if you want those same capabilities from IBM, it will have to come mostly in the form of services. The event made it clear that IBM backs the Apache Hadoop effort but not in the form of new products. In effect, IBM used its bully pulpit (not to mention its size and presence in the market) to discourage others from fragmenting the market. The announcements may also have been intended to buy time for further product developments. I look for more definition from IBM on its product roadmap. If it wants to remain competitive in the big-data market, IBM needs to articulate how its products will interact with and support Hadoop. In my soon to be released Hadoop and Information Management benchmark research that I am completing will provide some facts on whether or not IBM is making the right bet on Hadoop.

Regards,

David Menninger – VP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other followers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

David Menninger – Twitter

Top Rated

Blog Stats

  • 45,986 hits
%d bloggers like this: