Understanding Databricks ETL: A Quick Guide with Examples
Databricks provides high-performance and scalable data storage, analysis, and management tools for both structured and unstructured data. Built as a cloud-native solution that pairs seamlessly with Microsoft Azure or Amazon S3, Databricks is an attractive option for companies looking to move their data management to the cloud while benefiting from a sophisticated array of technical infrastructure, like built-in AI and machine learning tools, and interoperability among other cloud-hosted services.
Databricks is called a data lakehouse, which is a clever combination of the existing concepts of a data lake and a data warehouse. Data lakes provide storage for raw data, often in the form of blobs or files, rather than structured data like database tables. Data lakes impose fewer restrictions on data as it is ingested into the system, and utilize many sophisticated technical approaches to optimize data access within unstructured data. Data warehouses work primarily with structured data, similar to traditional databases, but provide additional data governance, access, and management tools on top of the data structures themselves.
Data lakehouses attempt to combine the best of both words, providing storage for both structured and unstructured data while including the higher-level management and analysis tools from data warehouses.
Due to its somewhat unique status as a premier data lakehouse, Databricks is an attractive option for any organization moving their data storage platform to the cloud while looking to maintain the ability to manage both structured and unstructured data. To fully leverage the advantages of Databricks, these organizations will need to develop an approach to Databricks ETL so that their organization’s data is appropriately stored and processed on the platform.
Understanding ETL
Like all data storage solutions, effective use of Databricks depends on the ability to successfully load data into the platform. Databricks serves as a central repository for data, but the process of aggregating and consolidating data in Databricks does not happen automatically. Part of the task of migrating data storage solutions is building a pipeline that connects existing data with the intended destination.
A simple way to understand this is by analogy to the process of buying a new house. While a new house may have a long list of benefits like more space, a better neighborhood, and higher-quality construction, it can only be fully utilized as a home once your existing belongings are moved from your current house to the new one. The process of packing up your belongings and hiring movers (or leaning on a good friend or two) mirrors the process that organizations must endure to migrate their data into their new data platform.
Only once an organization has a reliable data pipeline connecting myriad data sources with Databricks as the data destination can the true value of Databricks be realized.
The primary approach for this kind of data pipeline is called ETL (extract, transform, load). This describes a process by which data is extracted from platform where the data is generated (such as an accounting platform, ERP system, marketing automation hub, CRM application, etc), transformed into a data model that meets an organization’s needs, and then loaded into Databricks as the destination.
The result of an effective ETL pipeline is that data generated anywhere across an organization ends up in the intended data destination, Databricks, without manually copying or moving the data. Once the data has been moved into Databricks via an ETL pipeline, data analysts and other data professionals can access and process all their data in a single place. This enables efficient building of BI analytics processes, reporting and audits, and other data consumption operations all in one virtual location.
Understanding Databricks ETL
While ETL pipelines share conceptual similarities regardless of the data storage solution being used, understanding the specifics of these pipelines depends on understanding the idiosyncrasies of the data lake, warehouse, or lakehouse involved. The natural starting point is understanding a few specifics regarding Databricks’ architecture.
Databricks architecture
The layers of Databricks’s architecture, starting from the lowest and working upward, are as follows:
- Base-layer object storage (e.g. Blob, S3)
- Delta Lake virtual tables
- Delta Engine
The base object storage layer provides unstructured data storage, and typically invokes an existing blob storage solution like Azure Blob or Amazon S3. This layer handles merely the essentials of keeping data in a catalogued location, while imposing as few restrictions as possible on the size, structure, and type of data involved. A fundamental aspect of Databricks value is its ability to abstract away the details of this base-layer blob storage so that users do not even need to be aware of how data is stored at a low level. To help abstract away the technical specifics of the storage layer, Databricks uses a proprietary protocol called Delta Lake to impose virtual structure onto the data.
Delta Lake extends the concept of unstructured data storage by using flexible file types (Parquet files) and a file-based transaction log to support certain data operations on the underlying unstructured data that is typically only supported for structured (e.g. tabular) data. Collectively, these guarantees about data operations are known as ACID (atomicity, consistency, isolation, and durability). While it extends beyond the scope of this article to explain the technical details of how Delta Lake accomplishes the imposition of these features, the upshot is that Delta Lake looks and feels more like a traditional database or data warehouse despite being built on top of an unstructured data storage layer in a way that is more associated with a data lake.
The final layer of Databricks’ architecture is the Delta Engine, a query engine that leverages the structured operation support from Delta Lake to perform optimized data queries that mirror what would be expected in traditional data storage platforms. Stacking these architectural layers makes Databricks look and feel like a data warehouse while supporting the data demands satisfied by a data lake.
With this knowledge of Databricks’ architecture, it’s easier to examine the details involved in building Databricks ETL pipelines. The critical layer in the architecture is the Delta Lake layer, since this is the layer in which data gains the ‘shape’ it needs to be useful to data professionals. If an organization can build a data pipeline that migrates data into this Delta Lake layer, then the data is effectively available for management, analysis, and processing in Databricks.
Understanding Databricks ETL via Auto Loader
There are two methods of ingesting data into the Delta Lake layer:
- Databricks’ built-in tool, Auto Loader
- Third-party ETL tools like CData Sync
While Auto Loader may seem like the natural choice, it comes with a set of downsides that are worth considering for any organization migrating to Databricks. First, Auto Loader requires functional knowledge of Python Scala and the Spark SDK, and is configured using custom scripting. One of the first things encountered on the documentation page for building a basic ETL pipeline is an example Scala script for configuring Auto Loader.
If your organization already has Scala expertise, and you prefer the granular control that custom scripting affords over the operation of your ETL operation, then this Auto Loader may still seem attractive. But it’s also important to consider that Auto Loader was designed with the intention to integrate the base-layer blob storage of Databricks’ architecture with the Delta Lake structured layer. As a result, many of the specifics regarding Auto Loader configuration and operation take as the default expectation that the data to ingest into Databricks is unstructured.
However, much of the data captured elsewhere in an organization, such as by sales and marketing teams, has a natural structure. Accounts, Campaigns, Opportunities, and Ledgers all function as data objects with properties and relations with other objects, similar to database tables with columns and foreign keys. Thus, converting your organizations data into unstructured data just in order for Auto Loader to re-ingest it as structured data is a bit like taking apart your car before bringing it in to a mechanic for fixing.
Understanding Databricks ETL via tools like CData Sync
The alternative to Auto Loader for Databricks ETL is a dedicated third-party tool like CData Sync. These solutions do not come bundled with Databricks, but are engineered to integrate seamlessly with a Databricks environment.
ETL tools operate under the default assumption that the data you need to migrate to Databricks is structured according to data objects like Accounts and Opportunities, so you don’t need to worry about disassembling and reassembling your data structures.
Dedicated ETL tools are typically designed with less technical users in mind, so while they accomplish the same data movement that the Auto Loader approach supports, they don’t tend to require custom scripting or advanced knowledge of the Databricks environment. Rather, third-party ETL tools tend to require a basic set of connection properties like a URL, an account identifier, and the name of the appropriate Databricks environment.
Simplification of the ETL process through third-party tools helps data engineers and data scientists maintain control over their own data pipelines without needing constant support from a programming or IT team. Thus, tools like CData Sync are often used when an organization wants to migrate to Databricks, but does not already have the technical resources in place that would make the transition easy to accomplish in a short timeframe.
How to Set Up Databricks ETL
The details of setting up Databricks ETL again depends on which of the two approaches is used:
- Databricks’ built-in tool, Auto Loader
- Third-party ETL tools like CData Sync
Using Auto Loader
The full process for establishing a pipeline using Databricks’ built-in ETL tools is as follows:
- Create a new resource of type Azure Databricks in your Azure portal
- Log into your Azure portal, then navigate to Create Resource -> Analytics -> Azure Databricks
- Assign a workspace, resource group, location, and pricing tier to your new Databricks ETL resource
- Within your Databricks ETL resource, create a Spark Cluster
- Select Launch Workshop -> New Cluster
- Assign values like a name, mode, etc. to the new cluster
- Create a Scala Notebook for the new cluster where you will script the operations to ingest data into Databricks
- Select Create Notebook and set Scala as the language
- Create a connection string within Scala using the parameters from your Databricks instance
- Parameters include OAuth credentials like App ID, Client Key, Secret Key
- Read data from specified files using the Spark SDK
- g. methods like spark.read
- Transform data using SQL-based transformation syntax within the Spark SDK
- Load the data into Databricks using the Hadoop objects within the Spark SDK
Using a third-party ETL tool
The full process for establishing a pipeline with an external ETL tool may vary depending on the tool, but the overall approach is as follows:
- Set up a connection to Databricks
- Provide basic connection parameters like a URL, App ID, etc.
- Set up connections to data sources like CRMs, ERPs, accounting systems, etc.
- Again, provide basic connection parameters
- Choose the data objects within data sources that should be migrated to Databricks
- Choose an automation schedule according to which data is replicated from each source to Databricks
CData Sync for Databricks
CData Sync provides automated continuous ETL/ELT data replication from any on-premises or cloud-hosted data source directly into Databricks. Sync can be hosted within your network or in the private cloud that hosts your Databricks environment.
Sync strips away the complexity of building ETL pipelines using the native Databricks tools and instead relies on a simple combination of configured connections and replication jobs to ensure that your critical business data ends up in Databricks. No matter how many different platforms you use for customer management, accounting, ERP, marketing automation, or raw data storage, Sync can seamlessly pipe data from every source into Databricks. Sync also helps organizations bridge the cloud-premise gap by connecting to both local and cloud-hosted platforms without any extra hassle.
Leveraging the power of Databricks requires the ability to populate the platform with data from all of your data sources. The most common strategy is building an ETL pipeline, and the easiest way to build an ETL pipeline is via CData Sync.