You are currently browsing the tag archive for the ‘data science’ tag.

Predictive analytics is a rewarding yet challenging subject. In our benchmark research on next-generation predictive analytics at leastvr_NG_Predictive_Analytics_16_why_users_dont_produce_predictive_analyses half the participants reported that predictive analytics allows them to achieve competitive advantage (57%) and create new revenue opportunities (50%). Yet even more participants said that users of predictive analytics don’t have enough skills training to produce their own analyses (79%) and don’t understand the mathematics involved (66%). (In the term “predictive analytics” I include all types of data science, not just one particular type of analysis.)

Various software vendors are taking steps to simplify the use of this technology. RapidMiner is one of them. The company focuses on making its open source predictive analytics faster and easier to use. Its database-independent predictive analytics platform has more than 1,400 customers and averages 20,000 downloads per month. The product, also called RapidMiner, has been deployed more than 100,000 times and has a community of some 250,000 users. The latest version of the platform, Version 7.1 was released in the spring. RapidMiner has been around for almost 10 years, and in that time, the predictive analytics market has grown and changed dramatically in parallel with the big data market. Big data was not part of the original focus of the company, nor was cloud computing, but over time RapidMiner has incorporated capabilities in both areas.

The company also has a distinctive personality embodied by its founder and president, Ingo Mierswa. It is evident in his YouTube video series, “5 Minutes with Ingo”, in which he explains various aspects of predictive analytics. This approach to training potential users makes sense. According to our research, adequate training in predictive analytics concepts and the application of predictive analytics to business problems correlate more highly with satisfaction in using it (93% each) than does product training (85%). These satisfaction rates compare favorably with just 66 percent on average. The RapidMiner training videos are not only entertaining, they can potentially help an organization be more successful in understanding and using predictive analytics.

The RapidMiner product set itself provides several approaches to predictive analytics. RapidMiner Studio is a desktop tool for creating predictive analytic models. It is available for download from the RapidMiner website. Like many other predictive analytics tools, it includes connectors to a variety of data sources and supports data preparation tasks that are often needed before predictive models can be developed. Using drag-and-drop visual design, users create data flows or pipelines of activity moving data from sources, through any necessary transformations and into modeling processes.

RapidMiner Studio has several unique features to guide the user through these processes. In designing the overall pipeline of activity, a feature called Wisdom of Crowds examines what other users have done in similar situations and recommends what the next step (or “operator”) ought to be. Behind the scenes, RapidMiner is using its own technology to help predict the most likely next step. Wisdom of Crowds also provides parameter recommendations to help choose among the myriad of options and parameter settings. As further techniques to assist users, RapidMiner Studio has components to compare multiple models and to select models automatically.

While users can perform the entire predictive analytics process using RapidMiner Studio alone, they also can connect it to RapidMiner Server to support larger data sets and collaboration among multiple users. The Server product has a shared repository for processes, data and connections to other data sources and includes a framework to provide security and version control for the various items in the repository. As an alternative to an on-premises server, RapidMiner Cloud provides the same capabilities as the server product in a hosted environment.

For big data analytics RapidMiner Radoop leverages Hadoop implementations by pushing down the predictive analytics pipelines created in RapidMiner studio. These pipelines execute in the appropriate Hadoop component including MapReduce, Spark, Pig, Hive and Mahout, allowing access to the full data set and taking advantage of the cluster resources for parallel execution of the workloads without the need to code in any of these tools. Spark has become a popular framework for analytics on Hadoop, as evidenced by the Spark Summits, which I wrote about recently. It provides faster execution of analytic processes and a more flexible, expressive framework than MapReduce. Users familiar with Spark (R or MLlib) PySpark, Pig or Hive can write scripts in these packages that can be executed with Radoop. For security and authentication Radoop integrates with Kerberos, Apache Sentry and Apache Ranger.

RapidMiner recognizes the value of visualization in the analytics process and has established technical partnerships and integration with two providers, Qlik and Tableau. RapidMiner Studio can create both Qlik and Tableau data exchange files for visualization of the output of predictive analytics models. Other connections, integrations and extensions are available through the RapidMiner marketplace including Cassandra, MongoDB, SolR and Splunk.

To gain maximum value from predictive analytics, organizations must not only create the models to predict behaviors, they must deploy those models in an operational context to impact business outcomes in real time. According to our research more than one-third (37%) of organizations are applying their models at least on a daily basis. RapidMiner can convert any of its pipeline processes into a Web service so they can be embedded in other business processes and invoked in real time. RapidMiner also supports PMML, which is an industry standard for expressing models and allows embedding of models into databases for real-time scoring of new data records as they are entered into the database.

While RapidMiner has invested in making predictive analytics easier to use and accessible to a wider group of analysts, it is a daunting challenge to make these types of analyses truly self-service. Knowing when to use a particular algorithm and how to set all the various parameters requires deep knowledge of the discipline of predictive analytics. For example, in creating a k-nearest neighbors model, how many people would know what value of “k” to use for the number of nearest neighbors to model? And this is just one relatively simple parameter on one type of algorithm. The Wisdom of Crowds parameter recommendations help, but it’s still not an automated process, and users should realize they will need at least some knowledge of the various algorithms to maximize the effectiveness of their modeling efforts.

I’d also like to see RapidMiner invest more in the model management process. Once a model is created, it immediately starts to become stale for various reasons. Market conditions change. New data is generated. The competitive environment changes. The key questions are how far out of date the model has become and when it should be replaced with a better model.  Models should constantly be re-evaluated. In our predictive analytics research 63 percent of organizations that update their models at least daily reported a significant improvement in their activities and processes, compared with 31 percent of those that update their models less frequently. Any vendor that automates this process could help organizations boost their effectiveness.

Overall RapidMiner has made predictive analytics more accessible to a wider audience via its products and its educational efforts. The company has done this in an entertaining way, which is important to retain the attention of those who are being educated. Predictive analytics is a critical aspect of maximizing the value of data in an organization. Those that are not taking advantage of these types of analytics should be. RapidMiner makes it easier to tackle some of these challenges and may help get any organization over the hump of learning how to build and deploy predictive analytic models.


David Menninger

SVP & Research Director

Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.


It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come in to the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps more minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. OurVentanaResearch_DAC_BenchmarkResearch data and analytics in the cloud benchmark research shows that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.


David Menninger

SVP & Research Director

Follow on

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other followers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

David Menninger – Twitter

Top Rated

Blog Stats

  • 45,986 hits
%d bloggers like this: