Insights from paper: FoundationDB: A Distributed Unbundled Transactional Key-Value Store

14 min readAug 27, 2024

1. Abstract

FoundationDB is an open-source transactional key-value store. It was created around 2009.

It is one of the first systems to combine the flexibility and scalability of NoSQL architectures with the power of ACID transactions.

FoundationDB adopts an unbundled architecture. It decouples three components:

An in-memory transaction management system (Data Plane)
A distributed storage system (Data Plane)
A built-in distributed configuration system (Control Plane).

All three components can be independently scaled out.

FoundationDB(FDB) integrates a deterministic simulation framework to test every new system feature.

FDB has implemented a minimal feature set that can be used to build semi-relational databases, document stores, object stores, and graph databases.

FDB powers cloud infrastructure at Apple, Snowflake, and other companies.

It provides consistency, robustness, and availability.

2. Introduction

A decade ago, NoSQL storage systems emerged. They offered ease of application development, making it simple to scale and operate storage systems, fault tolerance, and support for a wide range of data models.

These systems sacrificed transactional semantics and instead provided eventual consistency.

FDB avoids this trade-off by providing serializable transactions while scaling to handle the large workloads.

It also implements more advanced features, such as consistent secondary indices and referential integrity checks.

FDB is an ordered, transactional, key-value store natively supporting multi-key strictly serializable transactions across its entire key space.

One more thing, the development of FDB took a radical approach — before building the database itself, the team built a deterministic database simulation framework.

This framework can simulate a network of interacting processes and a variety of disk, process, network, and request-level failures and recoveries, all within a single physical process.

FDB achieves strict serializability by combining optimistic concurrency control (OCC) and multi-version concurrency control (MVCC).

FDB does not rely on quorums to mask failures; instead, it tries to detect and recover from them eagerly by reconfiguring the system.

3. Design

A database system has many tasks, such as data persistence, data partitioning, load balancing, membership and failure detection, failure recovery, replica placement and synchronization, overload control, scaling, concurrency, job scheduling, system monitoring and alerting, backup, multi-language client library, system upgrade and deployment, and configuration management.

It is too much to cover, so this paper focuses on FDB’s architectural design and its implications for transaction management, replication, and fault tolerance.

3.1 Design Principles

Divide-and-Conquer: FDB decouples the transaction management system (write path) from the distributed storage (read path) and scales them independently.

Within the transaction management system, processes are assigned various roles representing different aspects of transaction management, including timestamp management, accepting commits, conflict detection, and logging.

Make failure a common case: In the distributed system, failures are the norm rather than an exception, so FDB handles them.

For example, the transaction system proactively shuts down when it detects a failure, reducing failures to a single recovery operation.

Fail fast and recover fast: To improve availability, FDB strives to minimize Mean-Time-To-Recovery (MTTR).

Simulation testing: FDB relies on a randomized, deterministic simulation framework to test its distributed database’s correctness.

3.2 System Interface

FDB exposes operations to read and modify single keys and ranges of keys.

get() — It reads a key.
set() — It writes a single key-value pair.
getRange() — It returns a sorted list of keys and their values within the given range.
clear() — It deletes all key-value pairs within a range or starting with a certain key prefix.

A transaction’s writes are buffered by the FDB client until the final commit() call.

Key and value sizes are limited to 10 KB and 100 KB, respectively, for better performance. The transaction size is limited to 10 MB.

3.3 Architecture

An FDB cluster has a control plane for managing critical system metadata and cluster-wide orchestration and a data plane for transaction processing and data storage. The diagram below shows the high-level architecture.

**The architecture and the transaction processing of FDB**

3.3.1 Control Plane

The control plane is responsible for persisting critical system metadata.

These Coordinators form a disk Paxos group. They select a singleton ClusterController. It monitors all servers in the cluster and recruits three singleton processes: Sequencer, DataDistributor, and Ratekeeper.

The DataDistributor is responsible for monitoring failures and balancing data among StorageServers.

The Ratekeeper provides overload protection for the cluster.

3.3.2 Data Plane

FDB is designed to target OLTP workloads. These workloads should be read-only, read and write on a small set of keys, have low contention, and require scalability.

As discussed previously, FDB chooses an unbundled architecture.

A distributed transaction management system (TS) performs in-memory transaction processing.
A log system (LS) stores Write-Ahead-Log (WAL) for TS.
A distributed storage system (SS) stores data and service reads.

The TS provides transaction processing and comprises a Sequencer, Proxies, and Resolvers. All these are stateless processes.

The LS contains a set of LogServers.

The SS has several StorageServers.

The Sequencer assigns a read version and a commit version to each transaction and recruits Proxy Servers, Resolvers, and Log Servers.

The Proxies offer MVCC read versions to clients and orchestrate transaction commits.

Resolvers check for conflicts between transactions.

LogServers act as replicated, sharded, distributed persistent queues, where each queue stores WAL data for a StorageServer.

StorageServers are for serving client reads. Each StorageServer stores a set of data shards ( contiguous key ranges).

At the time of paper writing, the storage engine on StorageServer is a modified version of SQLite.

3.3.3 Read-Write Separation and Scaling

In FDB, processes are assigned different roles, such as coordinators, storage servers, and sequencers.

The database scales by expanding the number of processes for each role.

The client’s directly issued reads to sharded StorageServers, and the reads scale linearly with the number of StorageServers.

The writes are scaled by adding more processes to Proxies, Resolvers, and LogServers.

3.3.4 Bootstrapping

FDB has no external dependency on other services.

All data and most of the metadata are stored in StorageServers. The metadata about StorageServers is persisted in LogServers. The LS configuration is stored in all Coordinators.

The newly elected ClusterController recruits a new Sequencer, which reads the configuration of the old LS stored in Coordinators and spawns a new TS and LS.

The Sequencer waits until the new TS finishes recovering. Then, the new LS configuration is written to all Coordinators. At this point, the new transaction system is ready.

3.3.5 Reconfiguration

FDB has a reconfiguration process. It restores the transaction management system to a new ( clean) state in case of TS/LS failure and database configuration change.

The Sequencer process monitors the health of Proxies, Resolvers, and LogServers.

If any monitored processes fail or the database configuration changes, the Sequencer process terminates. The ClusterController detects the Sequencer failure event and recruits a new Sequencer.

The new Sequencer follows the above-mentioned bootstrapping process.

3.4 Transaction Management

3.4.1 End-to-end Transaction Processing

A client transaction starts by contacting one of the Proxies to obtain a read version/ timestamp.

The Proxy asks the Sequencer for a read version. This version is guaranteed to be at least as large as any previously issued transaction commit version.

The client may issue multiple reads to StorageServers and obtain values at that specific read version.

Client writes are buffered locally without contacting the cluster. At commit time, the client sends the transaction data to one of the Proxies and waits for a response.

A Proxy commits a client transaction in three steps:

Proxy contacts the Sequencer to obtain a commit version larger than any existing read or commit versions. The Sequencer chooses the commit version by advancing it to one million versions per second rate.
The Proxy sends the transaction information to range-partitioned Resolvers. Resolvers check for read-write conflicts (called OCC). The transaction can proceed to the final commit stage if all resolvers return without conflict.
The committed transactions are sent to a set of LogServers for persistence. A transaction is considered committed after all designated LogServers have replied to the Proxy. Parallelly, StorageServers continuously pull mutation logs from LogServers and apply committed updates to disks.

FDB also supports read-only transactions and snapshot reads.

3.4.2 Support Strict Serializability

FDB implements Serializable Snapshot Isolation (SSI) by combining OCC with MVCC.

A transaction 𝑇𝑥 gets both its read version and commit version from Sequencer. The commit version defines a serial transaction history and serves as Log Sequence Number (LSN).

𝑇𝑥 observes the results of all previously committed transactions so FDB can achieve strict serializability.

A Proxy sends both the LSN and the previous LSN to Resolvers and LogServers so that they can serially process transactions in the order of LSNs.

The diagram above shows the lock-free conflict detection algorithm on Resolvers.

Each Resolver maintains a map 𝑙𝑎𝑠𝑡𝐶𝑜𝑚𝑚𝑖𝑡 of recently modified key ranges by committed transactions and their corresponding commit versions.

The commit request for 𝑇𝑥 comprises two sets: a set of modified key ranges 𝑅𝑤 and a set of read key ranges 𝑅𝑟.

The read set is checked against the modified key ranges of concurrent committed transactions, as shown in lines 1 to 5.

If there are no read-write conflicts, Resolvers update the list of modified key ranges for commit with the write set as shown in lines 6 and 7.

The 𝑙𝑎𝑠𝑡𝐶𝑜𝑚𝑚𝑖𝑡 map is represented as a version-augmented probabilistic SkipList.

The entire key space is divided among Resolvers, so the above read-write conflict detection algorithm may be performed in parallel. The key ranges of Resolvers are dynamically adjusted to balance their loads.

A subset of Resolvers may admit an aborted transaction, and they have already updated their history of 𝑙𝑎𝑠𝑡𝐶𝑜𝑚𝑚𝑖𝑡. In practice, this has not been an issue for production workloads because transactions’ key ranges usually fall into one Resolver.

The team’s micro-benchmark shows that one single-threaded Resolver can easily handle 280K TPS.

3.4.3 Logging Protocol

We have seen that once the Proxy decides to commit a transaction, the log messages are broadcasted to all LogServers (step 3).

The Proxy first consults its in-memory shard map to determine the StorageServers responsible for the modified key range.

The Proxy attaches StorageServer tags to the mutation, where each tag has a preferred LogServer for storage. It is shown in the diagram below:

The mutation is only sent to the preferred LogServers (1 and 4). To meet the replication requirements, an additional LogServer 3 is added.

The log message header includes both the LSN and the previous LSN obtained from the Sequencer. The message also has the Proxy’s known committed version (KCV).

LogServers reply to the Proxy once the log data is made durable, and the Proxy updates its KCV to the LSN if all replica LogServers have replied and this LSN is larger than the current KCV.

As said in Step 3, the redo logs are pulled continuously from the LS to the SS and are not part of the commit path. It is performed in the background.

The diagram below shows the 12-hour time lag between StorageServers and LogServers in one of our production clusters.

The lag from StorageServers to LogServers

3.4.4 Transaction System Recovery

Most database systems use the ARIES protocol for recovery. The protocol depends on a write-ahead log (WAL) and periodic, coarse-grained checkpoints.

During the recovery, the system processes redo log records from the last checkpoint by re-applying them to the relevant data pages.

In FDB, recovery is very cheap. There is no checkpoint or no need to re-apply, redo, or undo the log.

StorageServers always pull logs from LogServers and apply them in the background, which essentially decouples redo log processing from the recovery.

The recovery process starts by detecting a failure. It starts a new transaction process in the system and ends when old LogServers are no longer needed.

The Sequencer executes recovery in several steps:

The Sequencer reads the previous transaction system states from the Coordinators and locks the coordinated states.
The Sequencer recovers previous transaction system states, including the information about all older LogServers, stops these LogServers from accepting transactions, and recruits a new set of Proxies, Resolvers, and LogServers.
Once previous LogServers are stopped and a new transaction system is recruited, the Sequencer writes the coordinated states with current transaction system information.
The Sequencer accepts new transaction commits.

In summary, the system has to determine the end of the redo log in old LogServers. This is called the Recovery Version (RV).

Rolling back the undo log is essentially discarding any data after RV in the old LogServers and StorageServers.

Each LogServer keeps the maximum KCV received and a Durable Version (DV), which is the maximum persisted LSN.

During recovery, the Sequencer attempts to stop all 𝑚 old LogServers. Let us say the replication degree for LogServers is 𝑘.

Once the Sequencer has received more than 𝑚 − 𝑘 replies, the Sequencer knows the previous epoch has committed transactions up to the maximum of all KCVs. This maximum value becomes the previous epoch’s end version (PEV).

All data before this version has been fully replicated.

The start version for the current epoch is now 𝑃𝐸𝑉 + 1. The Sequencer chooses the minimum of all DVs to be the RV. Logs in the range of [𝑃𝐸𝑉 + 1, 𝑅𝑉 ] are copied from the previous epoch’s LogServers to the current ones.

The overhead of copying this range is minimal because it only contains a few seconds of log data.

The diagram below shows how the Sequencer determines RV.

On the left, a Proxy sends redo logs to LogServers with a 𝐾𝐶𝑉 and the 𝐿𝑆𝑁. On the right, recovery uses the maximum of 𝐾𝐶𝑉 s and the minimum of 𝐷𝑉 s on a set of LogServers as 𝑃𝐸𝑉 and 𝑅𝑉, respectively.

3.5 Replication

FDB uses a combination of various replication strategies. Some of those are the following:

Metadata replication — The control plane’s system metadata is stored on Coordinators using Active Disk Paxos. This metadata is recoverable if the coordinators’ quorum is alive.
Log replication — Each sharded log record is synchronously replicated on 𝑘 =𝑓 +1 LogServers. The Proxy returns the commit response to the client only when all k writes are successful.
Storage replication — Every shard(a key range) is asynchronously replicated to 𝑘 = 𝑓 + 1 StorageServers. These servers are called teams. A failure of a StorageServer triggers DataDistributor to move data from teams containing the failed process to other healthy teams.

3.6 Other Optimizations

3.6.1 Transaction batching

The Proxy can groups multiple transactions received from clients into one batch.

Now the Proxy asks for a single commit version from the Sequencer, and sends the batch to Resolvers for conflict detection.

The Proxy writes committed transactions in the batch to LogServers.

The transaction batching reduces:

The number of calls to obtain a commit version from the Sequencer.
Allows Proxies to commit tens of thousands of transactions per second without significantly impacting the Sequencer’s performance.

3.6.2 Atomic operations

FDB supports atomic operations.

Some examples are atomic add, bitwise AND operation, compare-and-clear, and setversionstamp.

These atomic operations enable a transaction to write a data item without reading its value.

Atomic operations also eliminate read-write conflicts with other atomic operations on the same data item.

4. Geo-replication and failover

Synchronous cross-region replication provides strong consistency but it comes with the cost of high latency.

On the other hand if asynchronous replication is used , it reduces latency but the data is only persisted in the primary region.

FDB can be configured to perform either synchronous or asynchronous cross-region replication.

FDB has a third possibility that leverages multiple availability zones within the same region.

The diagram below shows the layout of a two-region replication of a cluster. Both regions have a data center (DC) as well as one or more satellite sites.

The Satellites are located in close proximity to the DC in the same region but are failure independent.

The satellites only stores log replicas.

Control plane replicas are deployed across three or more failure domains usually with at least 9 replicas. Relying on majority quorums allows the control plane to tolerate one site failure and an additional replica failure.

In the diagram One of the data centers (DC1) is configured with a higher priority compared to DC2. DC1 is primary and DC2 is secondary. DC2 doesn’t have TS.

Reads can be served from storage replicas at both primary and secondary data centers.

Writes are forwarded to the primary region and processed by Proxies in DC1, then synchronously persisted onto LogServers in DC1 and one or both satellite sites in the primary region.

The updates are then asynchronously replicated to DC2, where they are stored on multiple LS servers and eventually spread out to multiple StorageServers.

LogRouters implement a special type of FDB role that facilitates cross-region data transfer.

They were created to avoid redundant cross-region transfers of the same information. LogRouters transfer each log entry across WAN only once, and then deliver it to all relevant LS servers locally in DC2.

The cluster automatically fails-over to the secondary region if the primary data center becomes unavailable.

Satellite configuration can be specified per region. Each satellite is given a static priority, which is considered relatively to other satellites in the same region.

When DC1 in the primary region suddenly becomes unavailable, the cluster detects the failure and starts a new transaction management system in DC2.

New Log- Servers are recruited from satellites in the secondary region. During recovery, LogRouters in DC2 may need to fetch the last few seconds’ data from primary satellites.

After the recovery, if the failures in Region 1 are healed and its replication policy can again be met, the cluster will automatically fail-back to have DC1 as the primary data center due to its higher priority.

5. Lessons Learned

Architecture Design

The divide-and-conquer design principle has proven to be an enabling force for flexible cloud deployment.

Simulation Testing

Simulation testing has enabled FDB to maintain a very high development velocity with a small team.

Fast Recovery

Fast recovery is not only useful for improving availability, but also greatly simplifies the software upgrades and configuration changes and makes them faster.

5s MVCC Window

FDB chooses a 5-second MVCC window to limit the memory usage of the transaction system and storage servers.

The team experienced that 5s window is long enough for the majority of OLTP use cases.

6. Related Work

Key/value stores and NoSQL systems

BigTable, Dynamo, PNUTS, MongoDB, CouchDB and Cassandra do not provide ACID transactions.

RAMCloud is an in-memory key value system that only supports single-object transactions.

Google’s Percolator, Apache Tephra, and Omid had layered the transactional APIs on top of the key-value stores with snapshot isolation.

Concurrency Control

Many systems use the time of acquiring all locks to establish the serial order among transactions and to guarantee atomicity and isolation.

For example Spanner and CockroachDB does that.

A few other systems order transactions without locks.

H-Store, Calvin, Hekaton, and Omid execute transactions in timestamp order.

Unbundled database systems

Some databases separate transaction component (TC) from data component.

Deuteronomy creates virtual resources that can be logically locked in the transaction system, where a DC knows nothing about transactions, their commit or abort.

Solar combines the benefits of scalable storage on a cluster of nodes with a single server for transaction processing.

Amazon Aurora simplifies database replication and recovery with shared storage.

Bundled database systems

Silo and Hekaton have achieved high throughput using a single server for transaction processing.

Many distributed databases partition data to scale out.

FaRM and DrTM exploit advanced hardware to improve transaction performance.

Recovery

Most traditional database systems choose to implement a recovery protocol based on ARIES.

VoltDB uses command logging so that recovery starts from a checkpoint and replays the commands in the log.

NVRAM devices have been used to reduce recovery time.

Amazon Aurora decouples redo log processing from the database engine by leveraging smart storage, and only undo log is processed by the database engine.