You are currently browsing the tag archive for the ‘Data lake’ tag.

Data virtualization is not new, but it has changed over the years. The term describes a process of combining data on the fly from multiple sources rather than copying that data into a common repository such as a data warehouse or a data lake, which I have written about. There are many reasons for an organization concerned with managing its data to consider data virtualization, most stemming from the fact that the data does not have to be copied to a new location. It could, for instance, eliminate the cost of building and maintaining a copy of one of the organization’s  big data sources. Recognizing these benefits, many database and data integration companies offer data virtualization products. Denodo, one of the few independent, best-of-breed vendors in this market today, brings these capabilities to big data sources and data lakes.

Google Trends presents a graphic representation of the decline of the popularity of the term data federation and the rise in popularity VirtualizationTrendingof the term data virtualization over time. The change in terminology corresponds with a change in technology. The industry has evolved from a data federation approach to today’s cost-based optimization approach. In a federated approach, queries are sent to the appropriate data sources without much intelligence about the overall query or the cost of the individual parts of the federated query. Each underlying data source performs its portion of the workload as best it can and returns the results. The various parts are combined and additional post-processing performed if necessary, for example to sort the combined result set.

Denodo takes a different approach. Its tools consider the costs of each part of the individual query and evaluate trade-offs. As the saying goes, there’s more than one way to skin a cat; in this case there’s more than one way to execute a SQL statement. For example, suppose you wish to create a list of all sales of a certain set of products. Your company has 1,000 products (maintained in one system) and hundreds of millions of customer transactions (maintained in another system). The federated approach would bring both data sets to the federated system, join them and then find the desired subset of products. An alternative would be to ship the table of 1,000 products to the system that holds the customer transactions, load it as a temporary table and join it to the customer transaction data to identify the desired subset before sending the product data back to its source. Today’s data virtualization evaluates the costs in time of the two alternatives and selects the one that would produce the result set the fastest.

Data virtualization can make it easier, andvr_BDI_16_importance_of_virtualization therefore faster, to set up access to data sources in an organization. Using Denodo users connect to existing data sources, which become available as a virtual resource. In the case of data warehouses or data lakes, this virtual representation is often referred to as a logical data warehouse or a logical data lake. No matter how hard you work to consolidate data into a central repository, there are often pieces of data that have to be combined from multiple data sources. We find that such issues are common. In our big data integration benchmark research one-fourth (26%) of organizations said that data virtualization is a key activity for their big data analytics, yet only 14 percent said that they have adequate data virtualization capabilities.

Not all the work is eliminated by data virtualization. You must still design the logical model for the data that you want to provide, such as which tables and which columns to include, but that’s all. Virtualization eliminates load processes and the need to update the data. In the case of big data, there are no extra clusters to set up and maintain. The logical data warehouse or data lake uses the security and governance system already in place. As a result, users can avoid some of the organizational battles about data access since the “owner” of the data continues to maintain the rights and restrictions on the data. Our research shows that organizations that have adequate data virtualization capabilities are more often satisfied with the way their organization manages big data than are organizations as a whole (88% vs. 58%) and are more confident in the data quality of their big data integration efforts (81% vs. 54%).

In its most recent release, version 6.0, Denodo enhanced its cost-based query optimizer for data virtualization. Many of the optimizer’s features would be found in any decent relational database management system, but the challenge becomes greater when the underlying resources are scattered among multiple systems. To address this issue Denodo collects and maintains statistics about the various data sources that are evaluated at run time to determine the optimal way to execute queries. The product offers connectivity to a variety of data sources, both structured and unstructured, including Hadoop, NoSQL, documents and websites. It can be deployed on premises, in the cloud using Amazon Web Services or in a hybrid configuration.

Performance can be a key factor in user acceptance of data virtualization; users will balk if access is too slow. Denodo has published some benchmarks showing that performance of its product can be nearly identical to accessing data loaded into an analytical database. I never place much emphasis on vendor benchmarks as they may or may not reflect an actual organization’s configuration and requirements. However, the fact that Denodo produces this type of benchmark indicates its focus on minimizing the performance overhead associated with data virtualization.

When I first looked at Denodo, prior to the 6.0 release, I expected to see more optimization techniques built into the product. There’s always room for improvement, but with the current release the company has made great strides and addressed many of these issues. In order to maximize the software’s value to customers, I’d like to see the company invest in developing more technology partnershipsVR2015_InnovationAwardWinner with providers of data sources and analytic tools. Users would also find it valuable if Denodo could help manage and present consolidated lineage information. Not only do users need access to data, they need to understand how data is transformed both inside and outside Denodo.

If your organization is considering data virtualization technology, I recommend you evaluate Denodo. The company won the 2015 Ventana Research Technology Innovation Award for Information Management, and its customer Autodesk won the 2015 Leadership Award in the Big Data Category. If your organization is deluged with big data but is not considering data virtualization, it probably should be. As our research shows, it can lead to greater satisfaction with and more confidence in the quality of your data.

Regards,

David Menninger

SVP & Research Director

Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come in to the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps more minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. OurVentanaResearch_DAC_BenchmarkResearch data and analytics in the cloud benchmark research shows that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.

Regards,

David Menninger

SVP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22 other subscribers

RSS David Menninger’s Analyst Perspective’s at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

Top Rated

Blog Stats

  • 46,795 hits