Insight from paper: MotherDuck: DuckDB in the cloud and in the client

Hemant Gupta
6 min readAug 25, 2024

--

1. Abstract

MotherDuck is a service that connects DuckDB to the cloud.

If you are unaware of the DuckDB, Please read my post here.

MotherDuck provides the concept of hybrid query processing: the ability to execute queries partly on the client and partly in the cloud.

The paper describes the motivation for MotherDuck and some of its use cases.

MotherDuck allows existing DuckDB users to start using cloud computing without changing their queries.

It also allows users to share DuckDB databases with others through the cloud for collaboration.

2. Introduction

So, what does MotherDuck offer?

It offers DuckDB data storage and serverless query processing in the cloud.

DuckDB is an embedded analytical database system.

Before we proceed, let’s make it clear what is embedded here.

Embedded means that DuckDB runs inside the driver/API library used by the client process.

DuckDB is popular because:

  1. It has high performance.
  2. It is easy to use.
  3. It is tightly integrated with Python and R data science libraries.

Hybrid Query Processing:

MotherDuck allows customers to store DuckDB databases in the cloud. The cloud data can be in Parquet, CSV, or JSON files.

Users can simultaneously query local and cloud databases within the same query.

The MotherDuck optimizer will plan query operators to be executed close to where the data is.

It uses bridge operators to upload and download tuple streams between local and cloud. It is called hybrid query processing.

Reducing Cloud Footprint:

The team analyzed traces of cloud data warehouses. It found that >95% of databases are <1TB in size, and >95% of queries involve <10GB of data.

The current analytical cloud data platforms rely on scale-out architectures, which add system complexity but are optional by 95% of users.

MotherDuck advocates a serverless architecture that avoids scale-out, opting for the simplicity of a single-node architecture. The team has a slogan for it: Big Data is Dead.

Variating the number of cores and RAM given to a container provides elasticity for a single user.

Enabling New Applications:

Processing in both client and server opens up new applications since clients can process specific queries locally.

DuckDB can be compiled into Web Assembly (wasm).

An early demonstrator is the Mosaic framework for scalable, interactive data visualizations. DuckDB’s web interface also uses this new application architecture.

MotherDuck could also make applications more secure.

Scaling Existing Applications:

DuckDB is popular among data scientists for designing and evaluating ETL or analysis pipelines on a laptop.

Now, these scenarios can be enhanced by scheduling the designed pipelines to run in production in the cloud.

These production pipelines also benefit from MotherDuck’s ability to scale memory and CPU resources beyond a laptop.

Data scientists can also use the MotherDuck SHARE and ATTACH features to share databases in the cloud and collaborate in a team.

3. Background DuckDB

State-of-the-Art Architecture: DuckDB provides the following features:

  • It has columnar storage that is skippable.
  • It uses primary keys and foreign keys backed by the ART index.
  • Its columns can store nested datatypes.
  • Its database transactions are optimized for handling changes in bulk.
  • The MVCC is optimized for fast scans.
  • It uses a vectorized query execution engine that supports operating directly on lightweight-compressed data.
  • It has a LeanStore-inspired buffer manager for main memory query processing speeds in SSD.
  • It has hyperloglog powered statistics and cardinality estimation.
  • It has a rule-based optimizer using dynamic programming.
  • It supports push-based operator execution.

User-Friendly Features

  • It has friendly SQL language extensions.
  • It has rich timezone support and SQL macros.
  • It supports reading and writing for Parquet and Iceberg files.
  • It has “zero-copy” integration with dplyr, numpy, pandas, and Arrow.

Extension Modules

DuckDB can get new functionality via extension modules.

The system can be extended on several dimensions: data types, operators, optimizer rules, and even the parser.

MotherDuck takes advantage of the DuckDB extension mechanism. A signed (trusted) extension provides connectivity from DuckDB clients to MotherDuck in the cloud using hybrid query processing.

4.MotherDuck Architecture

High-Level Architecture

4.1 Infrastructure

MotherDuck is a SaaS. It runs a control plane with components responsible for many administrative tasks.

Compute

MotherDuck’s compute platform is built on top of shortlived, on-demand, allocated containers. Each container runs a single DuckDB instance.

The containers run in a VM that may have a local SSD, which the MotherDuck storage service uses for caching.

The amount of RAM and CPU resources allocated to a container can be changed.

MotherDuck shuts down containers completely when they are not used.

Storage

The storage layer is built on a shared, scale-out, distributed storage fabric provided by the Cloud vendor.

MotherDuck’s storage service provides client data isolation and improves performance by

  1. Leveraging local SSD resources for caching
  2. Adapting DuckDB database storage to better suit cloud storage systems

DuckDB’s native data format stores compressed columnar data in a single file.

The duckling storage extension allows for differential storage, where changed data is stored independently as a mutation tree.

It enables zero-copy duplication, sharing, branching, and time travel.

DuckDB provides multi-database storage. A single DuckDB can ATTACH multiple database files simultaneously in read-only or read-write mode.

MotherDuck allows multiple users to have read-only access to a database created and modified by another user.

4.2 Hybrid Query Processing

A DuckDB query goes through four phases -

Parsing, Binding, Query optimization, and Execution

The MotherDuck client extension module hooks into all four phases.

Order-aware Bridge Operators

The bridge operators download and upload tuple streams between the client and the cloud.

These operators aim to work well in situations where upload and download bandwidth is asymmetrical and handle the possibly different endianness from the client.

They can materialize tuples in the sink in the same order they were stored in the original pipeline source.

Remote-local Optimizer

The MotherDuck client extension adds a new rule to the DuckDB optimizer for planning hybrid query processing.

This rule splits the plan into fragments and designates each fragment to run locally or remotely, inserting bridge operators in between.

Virtual Catalog

Query planning is done locally, so local DuckDB should have access to metadata information about cloud-resident databases.

DuckDB’s extensible catalog uses a MotherDuck proxy catalog to create and maintain catalogs for all relevant cloud databases.

SQL Language Extensions

MotherDuck has changed the binding process of the table functions to read Parquet, JSON, and CSV files. Now, these can be used in the DuckDB’s FROM clause.

The MotherDuck client extension further registers a parser that adds support for new SQL statements.

The CREATE SHARE statement allows to share a cloud database with other MotherDuck users.

4.3 User Interface

MotherDuck comes with a built-in web-based user interface.

Interactive Result Set Exploration

An analytical task is like an iterative, user-driven optimization process for finding a suitable SQL query to answer an analytic question.

The user enters an SQL query, examines the query results, and then revises the query to improve its accuracy in answering the original analytic question.

The MotherDuck Web UI supports interactions by providing users with a notebook-style interface.

MotherDuckWeb GUI Example

Declarative Caching

When executing a query, the MotherDuck GUI accumulates the results in a local results cache.

CREATE TABLE localMemDb.main.cacheTable1 AS
SELECT * FROM ({userQuery})
LIMIT {CLIENT_CACHE_ROW_LIMIT}

DuckDB-wasm manages this in-memory table and runs in the user’s browser.

AI assistant

The MotherDuck GUI came with the possibility of formulating notebook queries in natural language.

In the background, MotherDuck AI will automatically generate an SQL statement, iteratively fix it if needed, and execute it against the current database.

5. Conclusion

MotherDuck connects the popular, lightweight, user-friendly open-source analytics database DuckDB to the cloud.

It empowers users to use more client-side resources. It simplifies cloud data system architecture by using a scale-up approach.

At the time of paper writing, MotherDuck has completed one year and is in beta launch.

References

Data Management for Data Science Towards Embedded Analytics

Scalability! But at what COST?

DuckDB-Wasm: Fast Analytical Processing for the Web

Fast serializable mvcc for main-memory database systems

--

--

Hemant Gupta

https://www.linkedin.com/in/hkgupta/ Working on the challenge to create insights for 100 software eng papers. #100PapersChallenge, AVP Engineering at CoinDCX