Insights from paper: Yellowbrick: An Elastic Data Warehouse on Kubernetes

Hemant Gupta
7 min readAug 25, 2024

--

1. Abstract

Yellowbrick is a database management system composed of a set of Kubernetes-orchestrated microservices.

The team has created an SQL interface for Kubernetes. It hides the details of the underlying microservices implementation from the end user.

The team also developed a reliable network protocol for efficient data exchange between nodes in the public cloud.

This paper provides an overview of Yellowbrick’s microservices approach to delivering elasticity, scalability, and separation of computing and storage.

The paper also describes the optimizations implemented in the operating system and in our software to drive efficiency and performance

2. Introduction

Recently, Kubernetes has become the de facto standard orchestration framework for containerized microservices.

Several data warehouses advertise the ability to deploy on Kubernetes, but they are all composed of fine-grained microservices.

Also, they need an SQL interface for Kubernetes.

3. Overview Of YellowBrick

Yellowbrick is an ACID-compliant MPP SQL relational data warehouse. It is designed to deliver instant elasticity, scalability, performance, efficiency, high concurrency, and availability.

Yellowbrick consists of three major components:

  1. The data warehouse manager — This provides the control plane for provisioning multiple separate data warehouse instances.
  2. A data warehouse instance — It manages a set of databases.
  3. An elastically scalable compute cluster — It adds compute capacity to a data warehouse instance for different workloads.

The storage is separated from compute.

The data is persisted in object storage as column-oriented, compressed files known as shards.

Each compute node in the cluster has a locally attached NVMe (nonvolatile memory express) SSD-based shard cache. Each compute cluster can scale from 1 node to 64 nodes. Compute clusters can be configured to suspend and resume automatically.

Each compute cluster can see all databases managed by an instance. Users are assigned to one or more compute clusters to which queries are submitted.

If multiple active compute clusters exist, an intelligent load balancer automatically routes to the one that can complete the query as quickly as possible.

The data warehouse manager is used to provision data warehouse instances. It provides a web-based user interface.

Yellowbrick runs in a customer’s cloud service provider account.

3.1 Microservices Architecture

Yellowbrick is composed of a set of microservices.

Collectively, all these microservices provide functionality for the database management system. They are shown in the diagram below.

Yellowbrick microservices

All these microservices are packaged as Linux container images.

Kubernetes provides container orchestration and resilience, ensuring the data warehouse is maintained in the desired state.

The data warehouse instance is the front-end microservice for the data warehouse. It manages connections to the data warehouse, performs query parsing, query plan caching, row storage, metadata management, transaction management, and more.

It is deployed as a singleton StatefulSet pod.

The data warehouse manager supports one or more data warehouse instances. It comprises pods providing UI, authentication, monitoring, configuration management, and workflow services.

Each compute node runs a single worker process deployed in a StatefulSet pod.

The worker is responsible for executing a portion of the query plan. New workers can be added or removed from a running compute cluster dynamically.

Yellowbrick built an SQL interface over Kubernetes to make managing computing clusters easy.

Using SQL, users can create, alter, suspend, resume, select, or destroy compute clusters.

3.2 Deployment Approach

Yellowbrick deployments in AWS are bootstrapped using the AWS CloudFormation service. The service provisions a VPC, load balancers, subnets, security groups, and an Elastic Kubernetes Service cluster with autoscaling enabled.

After that, it starts the data warehouse manager services and creates the rest of the artifacts.

4. Software Optimizations

4.1 Database Optimizations

Yellowbrick’s query engine supports parallel query plans, cost-based optimization, workload management, and parallel query execution.

Query plans are translated to C++ code, compiled by the compiler microservice, and distributed to the workers for parallel execution.

The SQL parser and planner are based on the PostgreSQL parser.

The query planner has been significantly modified.

Yellowbrick supports hash, sort-merge, and loop joins, as well as SQL rewrites for pushing down, eliminating, inferring, and simplifying predicates and joins.

Yellowbrick is a shared-nothing database.

Workers address a portion of the underlying data. Rows are allocated to workers based on hash values in a specified column.

Workers have an execution engine and a storage engine.

The execution engine uses a credit-based flow control framework to govern the resources consumed by each query. It is shown in the diagram below:

Query execution

Query execution graph nodes are granted credits to process data. Credits flow downwards through the graph, and data packets flow upwards.

The execution engine runs an object code instantiation of a query plan generated by LLVM inside the compilation microservice.

At the start of query execution, every thread for every worker is granted one credit. Credits flow downwards through the graph, and data packets flow upwards.

Graph nodes execute cooperatively and cannot be interrupted against their will.

Graph nodes can choose the data packet format in which they can work, and transpose nodes are injected into the execution plan to rearrange data optimally accordingly.

The storage engine manages the column-oriented shard files.

It reads column-oriented shard data from the local NVMe cache over the PCIe bus, decompresses, transposes, filters it using vectorized SIMD instructions, and then passes packets up the query execution graph.

Data packets are 256 KB and are designed to fit into the L3 cache.

The execution engine is fully multi-core and NUMA-aware.

The diagram below shows the life cycle of a query.

Query life cycle

Each query passes through several states.

Once submitted, a query runs to completion. It is canceled or fails with an error (DONE, CANCEL, and ERROR states).

If a query is restarted or returns an error, it may re-enter the cycle in the ASSEMBLE state.

4.2 Operating System Optimizations

Yellowbrick bypasses the Linux kernel for most system-level operations.

The overall aim is to ensure that data read from NVMe SSDs is preserved in the CPU caches.

At start-up, the memory manager controls the system memory to avoid kernel swapping.

Memory allocations are grouped by query lifetime to avoid memory fragmentation. The memory allocator is initialized during the C++ worker’s initial setup.

The allocator’s memory is mapped in one contiguous virtual address region. The map request and subsequent analysis guarantee that the allocator only uses memory in 2 MB or 1 GB HugePage blocks.

The team also implemented its own task scheduler, which runs in user space and is 500x faster than the regular Linux task scheduler.

Yellowbrick is a cooperative multitasking system. Time is divided into synchronized centisecond slots across a compute cluster.

Only one query is processed across the cluster during this time slot, and every CPU on every worker is entirely devoted to executing the current plan node for that query during the slot.

4.3 Networking Optimization using DPDK

Yellowbrick exchanges data between worker nodes using the Data Plane Development Kit (DPDK). Bypassing the kernel network stack and avoiding intermediate copies and system calls provides low latency and high bandwidth.

The team developed a network protocol based on UDP to provide reliable, ordered packet delivery and minimize CPU overhead.

DPDK is configured such that each vCPU thread on a worker connects to a corresponding vCPU thread on a different worker. Each thread has its receive and transmit queues, which are polled asynchronously.

To measure performance, the team executed a benchmark using the industry-standard TPC-DS workload on Yellowbrick running in AWS. The results are shown in the diagram below:

Sequential runtime of the 99 TPC-DS queries at 1 TB scale

4.4 Storage Optimizations

Yellowbrick’s hybrid storage engine design combines a front-end row store with a back-end column store. The data warehouse instance microservice manages the row store.

Data can be inserted into the row store at high speed on a record-by-record basis and is instantly accessible. Rows are automatically flushed into the column store over time.

Bulk loads of large data are inserted directly into the column store via parallel connections to the workers, bypassing the row store.

ACID properties are preserved across the row and column stores using a common transaction log with a read-committed level of isolation and multi-version concurrency control.

Shard files are immutable.

The deleted records are tracked through the presence of side files. These files have bitmaps that mask the deleted rows in their respective shard.

Workers read data in 256 KB blocks from the object store and cache them locally on NVMe SSDs.

Shard files are ~100 MB and are transactionally written to the object store in 2 MB blocks.

The team has implemented its own C++ S3 connection library to support deployments on AWS.

5. Conclusion and Future Work

Yellowbrick uses Kubernetes as the orchestration engine. It enables Yellowbrick to run anywhere.

Kubernetes manages the lifecycle of the data warehouse, providing elasticity, availability, and scalability.

This helped the team to focus on the core business of enhancing database performance and adding new features.

The optimizations implemented to reduce OS kernel overhead in Yellowbrick contribute significant performance benefits.

The custom network protocol based on DPDK for exchanging data between MPP nodes alone reduces the runtime of some queries by as much as 70% in the public cloud.

The team is investigating the impact of compression on networking performance and evaluating the performance impact of offloading the compression overhead to the network interface card.

References

C-Store paper post

Snowflake paper post

Amazon Redshift paper post

Pivotal Greenplum @ Kubernetes

Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm

--

--

Hemant Gupta

https://www.linkedin.com/in/hkgupta/ Working on the challenge to create insights for 100 software eng papers. #100PapersChallenge, AVP Engineering at CoinDCX