Google BigLake extends BigQuery across all data – Natural Self Esteem

For Gerrit Kazmaier, the distinction between managed databases and data lakes never made much sense, and it makes even less sense today when data piles up like towering mountains being pushed up by tectonic forces.

“That distinction never made sense,” Kazmaier, general manager of databases, data analytics and Looker at Google Cloud, said at a virtual meeting with journalists and analysts this week. “It was a technical necessity because the amount of data kept growing and it became too complicated and expensive to manage in traditional data storage technologies.”

As the amount of data increased, companies turned to data warehouses. As volumes continued to increase — and with higher percentages of unstructured data — they began adopting data lakes to complement their data warehouses.

“This led to an obsessive need to store large amounts of data in different warehouses on a large scale at a relatively low cost,” Kazmaier said. “That was the start of the data lake movement. But it came at a heavy price. For all these organizations that tried to innovate based on the data, but at the end of the day only found a swamp of data, it came at a heavy price in terms of consistency, security and manageability.”

It also created separate data silos within the IT environment that businesses had to manage, a problem that other vendors, from Hewlett Packard Enterprise and Dell Technologies to Pure Storage and Hitachi Vantara, are trying to solve. Earlier this year we wrote about a startup called Onehouse that emerged from stealth with a plan to leverage the open source Hudi to bring database and data warehouse capabilities to data lakes and create lakehouses that can ingest and manage structured, semi-structured, and unstructured data.

Google Cloud wants to do something similar. At its Data Cloud Summit this week, the company is unveiling BigLake, which brings together data warehouses and data lakes to enable organizations to store, manage, and analyze their data through a single copy of data without having to duplicate or move it or to take care of the underlying data storage format or system.

BigLake extends Google Cloud’s BigQuery data warehouse capabilities to data lakes on Google Cloud Storage, uses an API interface for better access control on Google Cloud, and uses open formats like Parquet and open-source processing engines like Apache Spark. It eliminates what Kazmaier called the “artificial separation between managed warehouses and data lakes.”

BigLake, available in preview, is one of several new offerings and enhancements Google Cloud will be unveiling at the event, leveraging the organization’s work over the years with data tools like BigQuery, Vertex AI – a collection of services that Enabling organizations to build and manage machine learning workloads – the distributed SQL database management and storage service Spanner and the business intelligence platform Looker.

All of this, along with new offerings like the database migration program and updates in its partnership programs, aims to enable companies to more easily derive greater business value from the mountains of data they create. Google Cloud is the third largest cloud provider in the world with around 10 percent of global sales, behind Amazon Web Services (with around 33 percent) and Microsoft Azure (around 22 percent).

Addressing data challenges—not just storage and management, but movement, processing, analytics, and backup—can help Google Cloud further accelerate its multi-year effort to increase its presence in the enterprise. The market research company Statista predicts that more than 180 zettabytes of data will be created in 2025.

“Data is high on the agenda of every C-suite on this planet,” Kazmaier said. “We believe that you cannot transform using outdated technologies, outdated architectures, and outdated ideas to unlock the unlimited potential of data that truly holds. … Data is now in multiple formats, it is streamed and stored, it is now in data centers and even clouds. A data architecture must bring all of this together.”

Google Cloud was able to build what it had already done in data storage with services like BigQuery to grow BigLake.

“We have tens of thousands of customers on BigQuery, and we’ve invested heavily in overall governance, security, and all core functionality,” he said. “We’re taking this innovation from BigQuery and now we’re extending it to all data residing in different formats and in Lake environments, whether it’s on Google Cloud with Google Cloud storage, whether it’s on AWS or in Azure. We take the innovations and extend them to other data lake environments.”

Together with BigLake, Google Cloud will soon allow data engineers to track changes in their Spanner database in real time. Spanner Change Streams, coming soon, tracks inserts, updates, and deletes across the database. The changes can be replicated to BigQuery to drive analytics and saved to Google Cloud Storage for compliance.

Available now, the Vertex AI Workbench creates a single interface for data and machine learning systems, providing users with a common toolset for data analysis, data science, and machine learning, as well as direct access to BigQuery. Workbench also integrates with Serverless Spark and Dataproc, enabling organizations to build, train and deploy machine learning models five times faster than traditional systems, said June Yang, vice president of Cloud AI and Analytics Services on Google Cloud.

In addition, Google Cloud has Vertex AI Model Registry, a service in preview that makes it easier for data scientists to share models and for developers to turn data into predictions faster.

Connected Sheets and Data Studio for Looker are part of a process at Google Cloud to tighten the business intelligence service portfolio.

“We bring these two worlds together,” says Sudhir Hasbe, director of product management at Google Cloud. “Now you can leverage the self-service power of tools like Data Studio or Tableau and use Looker’s semantic layer core model, where you can define your metrics in a single place and all self-service tools work seamlessly and with they interact with it. This will allow organizations and power users to have self-service tools, but also centralize metrics and have a common understanding of the business across the enterprise.”

Leave a Comment