Azure Databricks and Lambda Architecture

To get started with Microsoft Azure Databricks, log into your Azure portal. If you do not have an Azure subscription, create a free account before you begin. Another great way to get started with Databricks is a free notebook environment with a micro-cluster called Community Edition.

Nathan Marz coined the term Lambda Architecture (LA) while working at Backtype and Twitter. The main goal is to describe a generic, scalable and fault-tolerant data processing architecture. This is often used in social media systems that involve a stream of data being delivered in real-time.

Batch processing also called ETL (Extract, Transform, Load) perform the heavy lifting to update data warehouses for analysis and reporting. Traditional batch processes run nightly, a few times a day, or hourly. This has long been the core of Business Intelligence, and for many use cases, this type of delay is unacceptable. When a report is incorrect and the source system is updated, the end users have to wait until the next batch is ran. Lambda Architecture shortens the delay by adding a speed layer with the batch layer.

Based on the image above, new data (1) is sent to the system for processing. The data is sent to both the Speed and Batch layers. The Batch Layer (2) contains a master dataset that is immutable and is append-only. The Serving Layer (3) is what the end users interacts with for reporting or analysis purposes. Finally the Speed Layer (4) is a view that contains streaming data and provides high latency of updates.

Databricks uses Apache Spark as the core engine, and the streaming component is called Structured Streaming. Each stream is considered an unbounded table and new rows are continuously appended. This allows us to query the table similar to standard batch-like queries, but it runs incrementally to retrieve new data. When new data is returned any aggregations are automatically updated for us. This is different from other streaming engines as they require us to keep track of the aggregations.

While many may not see the need to include streaming technology into their systems, I see it as a nice addition to data warehousing. Being able to bridge the gap between new data and an end user report helps business make decisions faster.

For additional information please see the following links.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.