A key benefit of the cloud is that it can help organizations simplify their IT operations, particularly with Software-as-a-Service (SaaS) offerings that remove the legwork and headaches of deployment and management. But could the proliferation of SaaS services in the data, analytics, and AI space be overdoing it?
Earlier this year, we officially called on cloud providers to untangle the mess and make their clouds more user-friendly: Take the burden of integrating all your services off the customer’s shoulders. Google Cloud made some announcements at its data summit this week about a new connective tissue that should address some of this.
The headline: BigLake
The headline was Google’s announcement of the preview of BigLake, which Google calls a “storage engine.” What it really is is an API; It generalizes the API Google originally developed for BigQuery to project a tabular structure over data stored in Google Cloud (object) storage. This, in turn, allows analysis and query tools to access the data, providing a granular view of the data for which governance and access controls can be enforced.
On the one hand, BigLake is Google’s answer to data lakehouses, a rapidly evolving approach to blurring the distinctions between data lakes and data warehouses.
BigLake supports all major data lake formats including JSON, CSV and Parquet; these formats have emerged as the de facto standard alternatives to SQL for variably structured data. The API makes data in these formats look like the relational and clustered tables that BigQuery reads. And consistent with Google’s previous expansion of BigQuery into foreign territory with BigQuery Omni, this means that data stored in any Amazon S3-compatible object store is registered as a human-readable data source.
Google isn’t the first to get there. Amazon Redshift, Microsoft Azure Synapse Analytics, and others work with polyglot data stored in object storage, while Databricks, the originator of the lakehouse concept, has steadily upped its game to deliver more data warehouse-like performance for its Spark data lakes . But the key to BigLake isn’t just providing a data warehouse-like experience from cloud storage like BigQuery already did, but extending the construct so customers have consistent access across other Google and Open -Get source engines like Spark, Presto and Trino and hopefully in the long run 3rd party BI, AI and analytics services.
In fact, BigLake makes all data appear as it does in BigQuery, but doesn’t require you to be a BigQuery user. It will support open source data lake constructs like Delta Lake and Iceberg, with Hudi on the roadmap. Google services like Vertex AI, Dataproc, and serverless Spark can all access the data.
BigLake builds on last fall’s announcement by Dataplex, Google’s entry for Data Fabric. While BigLake provides the API that provides data in tables, Dataplex is used for the logical classification of data into hierarchical zones from the data lake that defines the data pool; data zone, which logically groups related data sets (which could be construed as a kind of workspace); and at the most granular level, data assets (the specific data entities). By organizing data logically, it can be more easily identified and managed. And as part of defining the data, it collects metadata that helps classify data and make it discoverable for a data catalogue.
Connect operational and analytical data
Another piece of the connectivity puzzle is bringing operational data together with analytics. Among this week’s announcements, Google previews Spanner Change Streams. It’s the second piece of the puzzle for integrating Spanner with BigQuery, since last year Google introduced the ability for BigQuery to run federated queries in Spanner. While last year’s announcement focused on data-at-rest, this adds the ability to update BigQuery in real-time, making it a far more robust alternative for customers who need to build their own code.
Of course, change data capture streams are nothing new. But in most cases, customers need to write their own integrations, such as when exporting a DynamoDB stream to a relational table endpoint. Or customers must opt for mixed workload databases that conditionally replicate incoming rows into parallel in-memory columnar stores. For cloud providers like Google that offer specialized databases, integrated Stream capabilities for capturing change data are essential to connect operational and analytical systems. For Google’s next move, we’d like to see similar integration with streaming, removing the need for Cloud Dataflow customers to write their own bridges.
Pick up the pieces
With a tip of the hat to the Average White Band, we’re pleased that Google is picking up the pieces from its data and analytics portfolio and starting to put them together. In this case, it extends the ability to granularly view, query, and manage permissions on data that would otherwise be buried in cloud object storage, a technology designed for economical storage but not for the finer aspects of access or management . Instead of reinventing the wheel, Google leveraged the capability it had already built for BigQuery.
Outside of the data tier, Google has made corresponding announcements about coupling Data Studio, Looker, and Connected Sheets. This wasn’t an example of adding new functionality, since users could already develop data visualizations in Looker; Generate visualizations from connected sheets in Data Studio; or import data from joined sheets into Looker. But it added the missing links so Data Studio users could generate visualizations from Looker ML data models and Looker users could explore data from Connected Sheets. And last fall, Google announced access to serverless Spark without requiring customers to sign up for Dataproc.
The next step, of course, is to expand the reach of all of these capabilities, initially within the Google Cloud portfolio, but ultimately to a third-party ecosystem as well. Today, BigLake only supports a handful of data file formats, but they’re the obvious early targets, and Google isn’t done yet. Likewise, Dataplex, designed to make data discoverable and facilitate consistent governance, powers a subset of Google’s data and analytics services, including BigQuery, Dataproc, and Data Catalog; That leaves other Google data platforms and services like Firestore and Dataflow as logical next targets.
In addition, there is a need to build critical mass with third parties; Obvious targets would be the databases that are part of Google’s open source data partnerships. And beyond the open database, there are third-party providers where Google already plays a role, such as Collibra (in which Google Ventures has a stake) or Trifacta (the Google OEMs as cloud data prep). Not to mention the broader landscape of third-party BI tools.
As Andrew Brust pointed out, this is all about what he calls the Analytics Stack Redux. “On the BI side, the merging of native and acquired technologies is very reminiscent of what Microsoft, IBM, SAP and Oracle did in the 2000s.” Having solutions that can be put together.
Featured image via Pixabay.