Earlier this week I attended Hadoop World in New York City. Hosted by Cloudera, the one-day event was by almost all accounts a smashing success. Attendance was approximately double that of last year. There were five tracks filled mostly with user presentations. According to Mike Olson, CEO of Cloudera, the conference’s tweet stream (#hw2010) was one of the top 10 trending topics of that morning. Cloudera did an admirable job of organizing the event for the Hadoop community rather than co-opting it for its own purposes. Certainly, this was not done out of altruism, but it was done well and in a way that respected the time and interests of those attending.
If you are not familiar with Hadoop, it is an open source software framework used for processing “big data” in parallel across a cluster of industry-standard servers. Hadoop is largely synonymous with MapReduce, but the Hadoop framework has a variety of components including a distributed file system, a scripting language, a limited set of SQL operations and other data management tools.
By the way, the name Hadoop comes from a stuffed toy – a yellow elephant – belonging to Doug Cutting’s son, which made an appearance at the event. Doug created Hadoop and is now part of Cloudera’s management team.
How big is “big data”? In his opening remarks, Mike shared some statistics from a survey of attendees. The average Hadoop cluster among respondents was 66 nodes and 114 terabytes of data. However there is quite a range. The largest in the survey responses was a cluster of 1,300 nodes and more than 2 petabytes of data. (Presenters from eBay blew this away, describing their production cluster of 8,500 nodes and 16 petabytes of storage.) Over 60 percent of respondents had 10 terabytes or less, and half were running 10 nodes or less.
The one criticism of the event I heard repeatedly was that the sessions were too short for the presenters to get into the meat of their applications. John Kreisa, VP of Marketing at Cloudera, told me he agreed and indicated that the sessions likely will be longer next year.
What is it that makes Hadoop an elephant in the room? Over the past 12 to 18 months Hadoop has gone mainstream. A year ago, you could still say it was a fringe technology, but this week’s event and the development of a strong ecosystem around Hadoop make it clear that it is a force to be reckoned with. Many of the analytic database vendors have announced some type of support for Hadoop. Aster Data, Greenplum, Netezza and Vertica were sponsors of the event. Data integration and business intelligence vendors also have announced support for Hadoop, including event sponsors Pentaho and Talend. An ecosystem of development, administration and management tools is emerging as well, as shown by announcements from Cloudera and Karmasphere.
My colleague wrote about Cloudera Version 3 when it was announced back in June. You can expect to see expect to see new Cloudera Distributions for Hadoop (CDH) annually. Cloudera Enterprise – the bundling of CDH, plus Cloudera’s Management Tools – will be released semi-annually. Version 3.0 is in beta now. Version 3.5 is planned for the first quarter of 2011 and includes real time activity monitoring and an expanded file browser among other things.
If you work with big data but don’t know about Hadoop, you should spend some time learning about it. Our research is already finding the need for simpler and more cost effective methods to manage and use big data for analytics, business intelligence and information applications. If you want to understand some of the ways in which Hadoop is being used, I have another blog coming that will discuss its value for your business.
Let me know your thoughts or come and collaborate with me on Facebook, LinkedIn and Twitter .
Regards,
David Menninger – VP & Research Director
10 comments
Comments feed for this article
January 28, 2011 at 8:23 pm
Jaspersoft Releases Version 4 and Tackles Large-Scale Data «
[…] of large-scale data sources including massively parallel processing (MPP) database technologies, Hadoop and NoSQL. The news here is not the support of MPP technologies, although Jaspersoft now supports […]
February 7, 2011 at 7:12 pm
MicroStrategy Combines Actions with Business Intelligence and Mobility «
[…] are some other observations about the event. Hadoop was in the room (see my post “Hadoop is the Elephant in the Room”). This association is not surprising since MicroStrategy often is associated with large data […]
March 3, 2011 at 5:48 am
Living in the Era of Hadoop and Large-Scale Data « David Menninger
[…] have written recently about another contender in this arena: an open source, parallel processing technique named Apache […]
September 23, 2011 at 5:40 pm
Karmasphere Makes Sense of Big Data «
[…] company in March 2010. Since then it has been active and visible in Hadoop-related events including Hadoop World, the IBM Big Data Symposium and […]
November 15, 2011 at 7:06 pm
The World of Big Data Gets Even Bigger at Hadoop World «
[…] Hadoop World 2011 event confirmed that the world of big data is getting even bigger. As I wrote of last year’s event, Hadoop, the open source large-scale data processing technology, has gone mainstream. And while […]
January 3, 2012 at 4:12 pm
What Enterprises Can Learn from Major Events and Surprises in 2011 «
[…] it was clear if you looked in the right places, the rise in importance of big data caught many by surprise. While big data and Hadoop are not […]
April 16, 2012 at 2:04 pm
Informatica Joins Big Data Market with New Release «
[…] social media, customer interaction systems and other systems, and big data processing, which means Hadoop, the open source software framework. Let’s look at each of these […]
April 16, 2012 at 3:28 pm
Living in the Era of Hadoop and Large-Scale Data «
[…] have written recently about another contender in this arena: an open source, parallel processing technique named Apache […]
February 26, 2016 at 2:59 pm
Spark Summit Shows Momentum in Adoption of Apache Spark |
[…] is where it starts to feel like, in the words of Yogi Berra, déjà vu all over again. In 2010 and 2011 I attended and wrote about events held in the very same venue. At the time, interest was […]
August 23, 2016 at 2:00 pm
A Recipe for Cooking with the Hadoop Ecosystem |
[…] was a little late to the party. I first wrote about Hadoop for Ventana Research in 2010. Apache Hadoop then was about four years old and consisted of three modules with three top-level […]