Insights from paper: Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Hemant Gupta
12 min readAug 26, 2024

--

1. Abstract

Amazon Aurora is a relational database in Amazon Web Services (AWS). It is designed for OLTP workloads.

This paper describes the design consideration and result architecture of Aurora.

The team believes the central constraint in high throughput data processing has moved from compute and storage to the network.

Aurora solves this constraint by pushing redo log processing to a multi-tenant scale-out storage service.

The paper explains the benefits, such as reduced network traffic, fast crash recovery, failovers to replicas without data loss, fault-tolerant storage, and self-healing storage.

2. Introduction

Previously compute and storage used to be the bottleneck for database systems.

Cloud has made decoupling of compute from storage possible.

The bottleneck is now moved to the network between the database tier requesting I/Os and the storage tier performing these I/Os.

The primary bottlenecks are packets per second (PPS) and bandwidth. Traffic is amplified because a performant database issue writes to the storage fleet in parallel.

Synchronous operations, like disk reads caused by a miss in the database buffer cache, can result in stalls and context switches. Transaction commits are another source of interference. A stall in committing one transaction can inhibit others from progressing.

Handling commits with multi-phase synchronization protocols such as 2-phase commits is challenging in a cloud-scale distributed system.

Aurora resolves these issues by aggressively leveraging the redo log across a highly distributed cloud environment.

The diagram below shows a service-oriented architecture with a multi-tenant scale-out storage service that abstracts a virtualized segmented redo log.

Move logging and storage off the database engine

The above architecture has three significant advantages over traditional approaches.

  1. The storage is an independent fault-tolerant and self-healing service across multiple data centers.
  2. Only redo log records are written to storage. It reduces network IOPS.
  3. Some of the most complex and critical functions are moved from one-time expensive operations in the database engine to continuous asynchronous operations amortized across a large distributed fleet.

The paper describes three contributions:

1. How to reason about durability at cloud scale and design quorum systems resilient to correlated failures.

2. How to leverage smart storage by offloading the lower quarter of a traditional database to this tier.

3. How to eliminate multi-phase synchronization, crash recovery, and checkpointing in distributed storage.

3. Durability At Scale

3.1 Replication and Correlated Failures

The storage nodes and disks can fail, so they must be replicated in some form to provide resiliency.

One approach to tolerate failures in a replicated system is to use a quorum-based voting protocol.

If each of the V copies of a replicated data item is assigned a vote, a read or write operation must obtain a read quorum of Vr votes or a write quorum of Vw votes respectively.

To achieve consistency, the quorums must obey two rules.

  1. Each read must know the most recent write, formulated as Vr + Vw > V.
  2. Each write must know the most recent write to avoid conflicting writes, formulated as Vw > V/2.

The team believes 2/3 quorums need to be improved. They replicate each data item 6 ways across 3 AZs with 2 copies of each item in each AZ. These 6 ways are called Protection Groups.

The database is partitioned into volumes of small, fixed-size segments. The segment size was 10 GB at the time of this paper. Segments are units of independent background noise failure and repair.

A storage system that can handle the long-term loss of an AZ can also hold a brief outage due to a power event or bad software deployment requiring rollback.

4. The Log Is The Database

Aurora offloads log processing to the storage service.

4.1 The Burden of Amplified Writes

A traditional database system like MySQL writes data pages to objects (heap files, b-trees, etc.) and redo log records to a write-ahead log (WAL).

Each redo log record consists of the difference between the after-image and the before-image of the page.

Some other data must also be written based on requirements.

For example, The writes made to the primary EBS volume are synchronized with the standby EBS volume using software mirroring as shown in the diagram below:

Network IO in mirrored MySQL

The above model is undesirable not only because of how data is written but also because of what data is written.

4.2 Offloading Redo Processing to Storage

When a traditional database modifies a data page, it generates a redo log record.

Aurora invokes a log applicator that applies the redo log record to the page’s in-memory before-image to produce its after-image.

The transaction commit requires the log to be written immediately, while the data page writing may be deferred.

Only redo log records cross the network in Aurora.

The log applicator is pushed to the storage tier, where it can generate database pages in the background or on demand.

Aurora continually materializes database pages in the background to avoid regenerating them from scratch on demand every time. This approach dramatically reduces network load despite amplifying writes for replication and provides performance and durability.

The storage service can parallel scale out I/Os without impacting the database engine’s write throughput.

The diagram below shows an Aurora cluster with one primary instance and multiple replica instances deployed across multiple AZs.

To measure network I/O, the team ran a test using the SysBench write-only workload with a 100GB data set for two configurations: one with a synchronous mirrored MySQL configuration across multiple AZs and the other with Amazon Aurora replicated across multiple AZs.

The table above shows the summarized results of the experiment.

After a crash, the system must start from the most recent checkpoint and replay the log to ensure that all persistent redo records have been applied.

The durable redo record application happens at the storage tier, continuously, asynchronously, and distributed across the fleet.

Any read request for a data page may require some redo records to be applied if the page is not current. As a result, the crash recovery process is spread across all normal foreground processing.

4.3 Storage Service Design Points

The main goal for the storage service was to minimize the latency of the foreground write request.

The diagram above shows various activities on the storage node.

  1. Receive the log record and add it to an in-memory queue.
  2. Persist record on disk and acknowledge.
  3. Organize records and identify gaps in the log since some batches may be lost.
  4. Gossip with peers to fill in gaps
  5. Coalesce log records into new data pages.
  6. Periodically stage log and new pages to S3.
  7. Periodically, garbage collect old versions.
  8. Finally, periodically validate CRC codes on pages.

5. The Log Marches Forward

Let’s understand how the database engine generates the log so that the durable, runtime, and replica states are always consistent.

Aurora implements consistency efficiently without an expensive 2PC protocol.

5.1 Solution sketch: Asynchronous Processing

The log advances as an ordered sequence of changes. Each log record has an associated Log Sequence Number (LSN), a monotonically increasing value generated by the database.

At a high level, Aurora maintains points of consistency and durability and continually advances them as it receives acknowledgments for outstanding storage requests.

The storage node gossips with the other members of its PG to fill any gaps.

The runtime state maintained by the database lets us use single-segment rather than quorum reads. The read needs quorum in case of recovery when the state is lost.

The database may have multiple outstanding isolated transactions. Determining whether to roll back is separate for each transaction if the database crashes or reboots.

The logic for tracking and undoing partially completed transactions is kept in the database engine.

The storage service determines the highest LSN to guarantee the availability of all prior log records. This LSN is called Volume Complete LSN, or VCL for short.

During storage recovery, every log record with an LSN larger than the VCL must be truncated. The database can further constrain a subset of allowable truncation points, which are called CPLs or Consistency Point LSNs.

Aurora defines VDL (Volume Durable LSN) as the highest CPL smaller than or equal to VCL. It truncates all log records with LSN greater than the VDL.

Let’s take an example to understand it properly. Aurora may have the complete data up to LSN 1007. It may have declared that only 900, 1000, and 1100 are CPLs. This means Aurora must truncate at 1000.

So Aurora is complete to 1007, but only durable to 1000. In summary, Completeness and durability are, therefore, different.

If the client does not need such distinctions, it can simply mark every log record as a CPL.

The database and storage interactions are as follows:

  1. Each database-level transaction is broken up into multiple mini-transactions (MTRs) that are ordered and must be performed atomically.
  2. Each mini-transaction is composed of multiple contiguous log records.
  3. The final log record in a mini-transaction is a CPL.

5.2 Normal Operation

Writes

The database continuously interacts with the storage service and maintains state to establish a quorum, advance volume durability, and register transactions as committed.

There can be a large number of concurrent transactions active in the database.

The database allocates a unique ordered LSN for each log record, subject to the constraint that no LSN is allocated with a value greater than the sum of the current VDL and a constant called the LSN Allocation Limit (LAL) (currently set to 10 million).

This limit ensures that the database does not get too far ahead of the storage system and introduces back pressure that can throttle the incoming writes if the storage or network cannot keep up.

Commits

Transaction commits are completed asynchronously.

When a client commits a transaction, the thread handling the commit request sets the transaction aside by recording its commit LSN as part of a separate list of transactions waiting on commit.

As the VDL advances, the database identifies qualifying transactions that are waiting to be committed and uses a dedicated thread to send commit acknowledgments to waiting clients.

Reads

The pages are served from the buffer cache. If the page requested is not in the cache, a storage IO request is made.

The system finds a victim page to evict if the buffer cache is complete.

Aurora database does not write out pages on eviction but enforces a guarantee that a page in the buffer cache must always be of the latest version.

The database does not need to establish consensus using a read quorum under normal circumstances.

When reading a page from a disk, the database establishes a read-point, representing the VDL when the request was issued.

The database can then select a complete storage node with respect to the read point.

Replicas

A single writer and up to 15 read replicas can all mount a shared storage volume.

The log stream generated by the writer is sent to the storage nodes and all read replicas, reducing the lag.

Replicas consume log records asynchronously from the perspective of the writer.

The replica obeys the following two important rules while applying log records:

  1. The only log records that will be applied are those whose LSN is less than or equal to the VDL.
  2. The log records of a single mini-transaction are applied atomically in the replica’s cache to ensure that the replica sees a consistent view of all database objects.

5.3 Recovery

Most traditional databases use a recovery protocol such as ARIES.

The protocol depends on a write-ahead log (WAL) with precise contents of all committed transactions.

These systems also periodically checkpoint the database to establish points of durability.

On crash recovery, the system processes the redo log records from the last checkpoint by applying each log record to the relevant database page using the log applicator.

This process brings the database pages to a consistent state at the point of failure.

After that, the in-flight transactions during the crash can be rolled back by executing the relevant undo log records.

6. Putting It All Together

The diagram below shows the high-level view of the components.

The database engine is a fork of the community version of MySQL/InnoDB. The main difference is how InnoDB reads and writes data to disk.

In Aurora InnoDB, the redo log records (representing the changes that must be executed atomically in each MTR) are organized into batches.

These log records are sharded by the PGs each log record belongs to, and these batches are written to the storage service.

The final log record of each MTR is tagged as a consistency point.

Aurora read replicas to get continuous information on transaction starts and commits in the writer. They use this information to support snapshot isolation for local transactions (read-only).

The concurrency control is implemented entirely in the database engine without impacting the storage service.

Aurora leverages Amazon Relational Database Service (RDS) for its control plane.

The storage service is deployed on a cluster of EC2 VMs provisioned across at least 3 AZs in each region.

The storage nodes manipulate local SSDs and interact with database engine instances, other peer storage nodes, and the backup/restore services, which continuously back up changed data to S3 and restore it from S3 as needed.

The storage control plane uses the Amazon DynamoDB database service for persistent storage of cluster and storage volume configuration, volume metadata, and a detailed description of data backed up to S3.

7. Lessons Learned

7.1 Multi-tenancy and database consolidation

Many Aurora customers operate Software-as-a-Service (SaaS) businesses. These customers often rely on an application they cannot easily change.

They typically consolidate their different customers on a single instance. For instance, some of Auroras SaaS customers have more than 50,000 customers of their own.

Production instances of over 150,000 tables for small database are quite common because of above model.

Aurora supports following customers needs:

  1. Sustain a high level of throughput and many concurrent user connections.
  2. Data is only provisioned and paid for as it is used.
  3. Reducing jitter so that spikes for a single tenant have minimal impact on other tenants.

7.2 Highly concurrent auto-scaling workloads

Nowadays, workloads often need to deal with spikes in traffic based on sudden, unexpected events. To support such spikes, it is important for a database to handle many concurrent connections.

This approach is feasible in Aurora since the underlying storage system scales so well. Aurora has several customers that run at over 8000 connections per second.

7.3 Schema evolution

Modern web application frameworks allow developers to make many schema changes to their database.

The team implemented an efficient online DDL implementation that

  1. Versions schemas on a per-page basis and decodes individual pages on demand using their schema history.
  2. Lazily upgrades individual pages to the latest schema using a modify-on-write primitive.

7.4 Availability and Software Upgrades

The team recently released a new Zero Downtime Patch (ZDP) feature that allows us to patch a customer while in-flight database connections are unaffected.

ZDP works by looking for an instant where there are no active transactions.

In that instant, ZDP spools the application state to local ephemeral storage, patches the engine, and then reloads the application state.

In the process, user sessions remain active and oblivious that the engine changed under the covers.

8. Related Work

Decoupling storage from compute

Deuteronomy separates a Transaction Component (TC) from a Data Component (DC).

Sinfonia and Hyder abstract transactional access methods over a scale-out service and database systems can be implemented using these abstractions.

The Yesquel system implements a multi-version distributed balanced tree, separating concurrency control from the query processor.

Distributed Systems

Bailis et al study the problem of providing Highly Available Transactions (HATs) that neither suffer unavailability during partitions nor incur high network latency.

Aurora provides some isolation levels by simplifying the assumption that a single writer generates log updates with LSNs allocated from a single ordered domain at any given time.

Google’s Spanner provides externally consistent reads and writes and globally consistent reads across the database at a timestamp.

Spanner is highly specialized for Google’s read-heavy workload and relies on two-phase commit and two-phase locking for read/write transactions.

Recovery

Hekaton and VoltDB rebuild their in-memory state after a crash using some form of an update log.

Graefe describes a system with per-page log record chains that enable on-demand page-by-page redo that can make recovery fast.

8. Conclusion

Aurora is a high throughput OLTP database that compromises neither availability nor durability in a cloud-scale environment.

It decouples storage from compute.

Aurora moved the lower quarter of the database kernel to an independent scalable and distributed service to manage logging and storage.

Aurora relies on quorum models that can handle complex and correlated failures.

References

High-Performance Transactions in Deuteronomy

Sinfonia: A new paradigm for building scalable distributed systems

The VoltDB Main Memory DBMS

Hyder — A Transactional Record Manager for Shared Flash

Yesquel: scalable SQL storage for Web applications

Graefe: Instant recovery for data center savings

Bailis: Eventual consistency today: limitations, extensions, and beyond

Hekaton: High-Performance Concurrency control mechanisms for main-memory databases

--

--

Hemant Gupta

https://www.linkedin.com/in/hkgupta/ Working on the challenge to create insights for 100 software eng papers. #100PapersChallenge, AVP Engineering at CoinDCX