You are currently browsing the monthly archive for February 2016.

Last week I attended Spark Summit East 2016 at the New York Hilton Midtown. It revealed several ways in which Spark technology might impact the big data market.

Apache Spark is an open source data processing engine designed for large-scale computing. Spark is often used in conjunction with the open source Apache Hadoop, but it can be used with other data sources as well such as Cassandra, MongoDB and Amazon S3. The creators of Spark founded Databricks, which drives the roadmap for Spark and leads community evangelism including organizing the Spark Summit events. According to Arsalan Tavakoli-Shiraji, VP of customer engagement and business development for Databricks, the company contributes approximately 75 percent of the code to the Spark project.

If you are wondering why Spark matters, consider that in June 2015, IBM announced that it would put more than 3,500 researchers and developers on Spark-related projects and would donate its IBM SystemML machine-learning technology to the Spark open source ecosystem. This scale of investment is likely to attract others in the industry, and interest in Spark is growing elsewhere, too. In his keynote presentation Matei Zaharia, co-founder and CTO of Databricks, claimed that summit attendees (across multiple locations) grew from 1,100 in 2014 to 3,900 in 2015. I was told by event staff that the New York event alone, including preconference training sessions, exhibitors and sponsors, had approximately 1,500 attendees, doubling attendance from the previous year.

This is where it starts to feel like, in the words of Yogi Berra, déjà vu all over again. In 2010 and 2011 I attended and wrote about events held in the very same venue. At the time, interest was growing in something called “Hadoop,” which, according to our recent benchmark research on big data analytics, has now grown to the point where it is used by 37 percent of organizations to process their big data analytics.

Recently there has been a lot of buzz around Spark, including speculation that Spark could replace Hadoop. Just search for “Spark replace Hadoop” and see how many pages come up. However, this is off target: Spark won’t replace Hadoop, in particular because it is not designed to store data. Rather it is designed to be used in conjunction with data storage tools such as the Hadoop Distributed File System (HDFS). Spark does, however, address a vr_Big_Data_Analytics_19_important_areas_of_big_data_analyticsshortcoming in one part of Hadoop: MapReduce. MapReduce was designed to process very large amounts of data in parallel to speed up processing, but it was designed for batch processing. As big data has grown in popularity, more and more users are accessing it and demanding interactive access to data regardless of its volume. More than half (54%) of participants in our research rated right-time and real-time analytics as important, and nearly half (48%) rated visual and data discovery important. Spark’s fundamental performance advantage over MapReduce lies in being designed to process data in memory to the extent possible.

Spark can be used to provide real-time analytic capabilities on big data. Spark has several components that provide those capabilities. Spark Streaming provides real-time capabilities for processing streaming data, that is, data being generated constantly. Spark SQL provides interactive query capabilities on structured data. And Spark ML provides machine-learning capabilities that can be used to build analytical models including predictive analytics. Our research shows that predictive analytics and statistics are the most important area of big data analytics, cited by more than three-fourths (78%) of participants.

At the conference Databricks announced Spark 2.0, planned for release in April or May. It will have three primary sets of enhancements. What is presently called Tungsten phase 2 improves performance of Spark with native memory management and better run-time code generation. Structured streaming combines real-time processing with batch and interactive queries. And the merging of APIs for DataFrame and DataSets will enable a single programming paradigm to be used for streaming and structured data.

There is a rich ecosystem developing around Spark. A variety of vendors attended the show including several of the Hadoop distribution vendors (Cloudera, Hortonworks, IBM and MapR) demonstrating and talking about how they support Spark. Analytic vendors including Alpine Data, Platfora, SAP, Stratio and Zoomdata demonstrated products that use Spark to provide interactive analytics over large amounts of data. Full-day training sessions that were part of the agenda were oversubscribed, with packed rooms and registration spots sold out on the event’s website.

The event was clearly focused on a technical audience, and a look through conference keynotes reveals many code snippets. However, this technical focus does not suggest a lack of business purpose in Spark applications. Use cases were also represented in the conference agenda, and future events may attract more business users. The popularity of the event and the acceptance of Spark by the Hadoop community and vendors suggest that Spark is here to stay – at least for the foreseeable future. If you are working with or plan to work with big data, it would be worthwhile to understand how Spark can be incorporated into your architecture, either directly as part your own programming efforts or via third-party tools to use in conjunction with big data.


David Menninger

SVP & Research Director

The big data market continues to expand and enable new types of analyses, new business models and new revenues streams for organizations that implement these capabilities. Following our previous research into big data and information optimization, we’ll investigate the technology trends affecting both of these domains as part of our 2016 research agenda.

A key tool for deriving value from big data is in-memory computing. As data is generated, organizations can use the speed of in-memory computing to accelerate the analytics on that data. Nearly-two thirds (65%) of participants in our big data analytics benchmark research identified real-time analytics as an important aspect of in-memory computing. Real-time analytics enables organizations to respond to events quickly, for instance, minimizing or avoiding the cost of downtime in manufacturing processes or rerouting deliveries that are in transit to cover delays in other shipments to preferred customers. Several big data vendors offer in-memory computing in their platforms.

Predictive analytics and machine learning also contribute to information optimization. These analytic techniques can automate some decision-making to improve and accelerate business processes that deal with large amounts of data. Our new big data benchmark research will investigate the use of predictive analytics with big data, among other topics. In combination with our upcoming data preparation benchmark research, we’ll explore the unification of big data technologies and the impact on resources and tools needed to successfully use big data. In our previous research, three-quarters of participants said they are using business intelligence tools to work with big data analytics. We will look for similar unification of other technologies with big data.

vr_Big_Data_Analytics_03_technology_for_big_data_analyticsThe emergence of the Internet of Things (IoT) – an extension of digital connectivity to devices and sensors in homes, businesses, vehicles and potentially almost anywhere – creates additional volumes of data and brings pressure for data in motion for both analytics and operations. That is, the data from these devices is generated in such volumes and with such frequency that specialized technologies have emerged to tackle these challenges. We’ll explore in depth the myriad issues arising from this explosion of connectivity in our benchmark research on the Internet of Things and Operational Intelligence this year.

Another key trend we will explore is the use of data preparation and information management tools to simplify accessibility to data. Data preparation is a key step in this process, yet our data and analytics in the cloud benchmark research reveals that data preparation requires too much time: More than half (55%) of participants said they spend the most time in their analytic process preparing data for analysis. Virtualizing data access can accelerate access to data and enables data exploration with less investment than is required to consolidate data into a single data repository. We will be tracking adoption of cloud-based and virtualized integration capabilities and increasing use of Hadoop as a data source and store for processing of big data. In addition, our research will examine the role of search, natural language and text processing.

We suggest organizations develop their big data competencies for continuous analytics – collecting and analyzing data as it is generated. It should start with establishing appropriate data preparation processes for information responsiveness. Data models and analyses should support machine learning and cognitive computing to automate portions of the analytic process. Much of this data will have to be processed in real time as it is being generated. All of these advances will need advanced methods for big data governance and master data management. We look forward to reporting on developments in these areas throughout 2016 in our Big Data and Information Optimization Research Agenda.


David Menninger

SVP & Research Director

Follow on

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other followers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

David Menninger – Twitter

Top Rated

Blog Stats

  • 46,527 hits
%d bloggers like this: