You are currently browsing the tag archive for the ‘Predictive Analytics’ tag.

I recently spent time at Strata+Hadoop World 2016 in New York. I attended this event and its predecessor, Hadoop World, off and on for the past six years. This one in New York had a different feel from previous events including the most recent event in San Jose at the end of March. Perhaps because of its location in one of the financial and commercial hubs of the world, the event had much more of a business orientation. But it’s not just location. Past events have been held in New York also, and I see the business focus as a sign of the Hadoop market maturing.

Our research shows that big data can have significant business benefits. In our Big Data and Analytics benchmark research, more than three-quarters (78%) of participants indicated that predictive analytics isvr_big_data_analytics_19_important_areas_of_big_data_analytics_updated the most important area of big data  analytics for their organization. In our Predictive Analytics research almost three out of five (57%) organizations said they have achieved a competitive advantage through their application of advanced analytics. Thus we are moving beyond the early adopter phase of the technology adoption life cycle into the early majority. More and more organizations recognize that big data and advanced analytics can provide a competitive advantage. As a result, we see more focus on the business value of it, not just the technology required to pursue this advantage.

At the Strata+Hadoop World keynote presentations many vendors chose to bring their customers on stage or share stories about how their customers are positively impacting their organizations with big data technology. There were also plenty of technical training sessions, including two full days of training prior to the keynotes and expo, but the main stage of the event was focused on what you can do with big data rather than how to do it. The attendees also seemed to bring a business focus to the event. I spoke with multiple vendors in the expo hall who had attended both the Strata+Hadoop event in San Jose earlier this year and the New York event. They all described customer interactions that had more of a business focus than at previous events. People came looking for ways to apply big data technology to real business needs.

This is not say there wasn’t plenty of technology at the event including in particular data science, streaming data and data preparation and governance. Tutorials were offered on a variety of data science topics including how to implement machine learning in programming languages such as Python and Spark. Our research shows that Python is one of the most popular languages for data science analyses, in use by more than one-third (36%) of organizations. As I have written previously, Spark is growing in popularity as a way of providing big data, machine learning and real-time capabilities. At least half a dozen vendors ranging from large to small participated in the expo, touting their data science capabilities, and many other vendors’ marketing materials described how they support data science, for instance with data preparation tools that enable the data science process.

Processing streaming data in real time was also a frequent theme. Part of what makes big data big is that it is being generated constantly. It follows that you can probably get value out of analyzing that data in real time as it is being generated. In our research real-time analytics is the second-most frequently cited (by 54%) area of big data analytics, after predictive analytics. In its original incarnation, Hadoop was designed as a batch-oriented system, but as it has grown in popularity, much attention has been given to adding real-time capabilities to the Hadoop ecosystem, which I have described.

The themes of data preparation and governance come as no surprise. Our Big Data Integration benchmark research shows that reviewing data for quality and consistency issues (52%) and preparing data (46%) are cited as the two most time-consuming aspects of the big data integration process. Similarly our big data analytics research shows that data quality and information management is the second-most common barrier to big data analytics, cited by 39 percent of organizations. Vendors and the big data community are on the right track in addressing these issues.

The big data community continues to evolve, and the Strata+Hadoop World events are helping to foster dialog, education and growth. I’d say that this most recent event is evidence that the big data community is “growing up,” meaning that the focus has shifted to delivering business value. Strata+Hadoop World is a place where you can learn not only about the technology of big data but also how to solve business problems.

Regards,

David Menninger

SVP & Research Director

Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come in to the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps more minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. OurVentanaResearch_DAC_BenchmarkResearch data and analytics in the cloud benchmark research shows that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.

Regards,

David Menninger

SVP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other followers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

David Menninger – Twitter

Top Rated

Blog Stats

  • 46,033 hits
%d bloggers like this: