Insights from Paper-TAO: Facebook’s Distributed Data Store for the Social Graph

Hemant Gupta
12 min readApr 26, 2023

--

Introduction

TAO is a geographically distributed data store built by Facebook. It provides efficient and timely access to the social graph.

TAO runs on thousands of machines and provides access to many petabytes of data. It can process a billion reads, and millions of writes per second.

Facebook has more than a billion active users. They record their relationships, share their interests, and upload text, images, videos, etc. This data is called the social graph. In this paper, we will understand how TAO is designed and implemented for a read-optimized graph store to serve the social graph.

Facebook used memcache as a look-aside cache, and web servers accessed MySQL to read and write the social graph before TAO was implemented.

There were some fundamental shortcomings of memcache, so Facebook developed TAO. TAO still uses MySQL for data storage but has its graph-aware cache.

Facebook has deployed a single geographically distributed instance of TAO. TAO has a minimal API and explicitly favors availability and per-machine efficiency over strong consistency. The best part of TAO is its scalability. It can sustain a billion reads per second on a changing data set of many petabytes.

Background

A single Facebook page may aggregate and filter hundreds of items from the social graph. Facebook presents each user with content tailored to them.

It is infeasible to perform most aggregation and filtering when content is already created. Instead, Facebook resolves data dependencies and checks privacy each time the content is viewed.

As much as possible, Facebook pulls the social graph instead of pushing. This implementation strategy places extreme read demands on the graph data store. Reads must be efficient, highly available, and scale to high query rates.

Serving the Graph from Memcache

I talked about previously, Facebook was using Memcache as a look-aside cache. This architecture was good for Facebook’s rapid iteration cycles.

TAO is constructed as a service that directly implements the objects(nodes) and associations(edges) model.

Let’s understand the issues with the previous architecture:

  1. Inefficient edge lists- A key-value cache is not good for fetching a large list of edges. A change will require reloading the entire list.
  2. Distributed control logic- The control logic is run on clients that don’t communicate with each other.
  3. Expensive read-after-write consistency- Facebook uses asynchronous master/slave replication for MySQL. It creates problems for caches in data centers using a replica.

TAO’s Goal

TAO It is optimized heavily for reads, and explicitly favors efficiency and availability over consistency.

TAO Data Model and API

Facebook focuses on people, actions, and relationships. Facebook models these entities and connections as nodes and edges in a graph.

TAO’s goal is not to support a complete set of graph queries. It has to provide the power to handle most application needs using an efficient and scalable implementation.

An example of objects and associations

In the diagram above, The social graph includes the users (Alice,
Bob, Cathy, and David), their relationships, their actions (checking in, commenting, and liking), and a physical location (the Golden Gate Bridge).

Objects and Associations

Let’s start with the basics:

  1. TAO objects are typed nodes.
  2. TAO associations are typed directed edges between objects.

Objects are identified by a 64-bit integer (id) that is unique across all
objects. Associations are identified by the source object (id1), association type (atype), and destination object (id2).

There can be at most one association of a given type between any two objects. Each association has a 32-bit time field.

Both objects and associations may contain data as key-value pairs.

There is a per-type schema that lists the possible keys, the value type, and a default value.

Object: (id) → (otype, (key value)∗)

Assoc.: (id1, atype, id2) → (time, (key value)∗)

Let’s learn the objects and associations in the example given in the diagram.

Users, check-in & comment, are objects.

Users’ friendships, authorship of the check-in and comment, and the binding between the check-in and its location and comments are associations.

Actions may be encoded either as objects or associations. Associations naturally model actions that can happen at most once or record state transitions. Repeatable actions are better represented as objects.

Although associations are directed, it is common for an association to be tightly coupled with an inverse edge.

Object API

TAO’s object API provides operations to allocate a new object and id, and to retrieve, update, or delete the object associated with an id.

Association API

Many edges in the social graph are bidirectional. Bidirectional edges are modeled as two separate associations. TAO provides support for keeping
associations in sync with their inverses.

Symmetric bidirectional types are their own inverses.

assoc_add(id1, atype, id2, time, (k→v)*) — Adds or overwrites the association (id1, atype,id2), and its inverse (id1, inv(atype), id2) if defined.
assoc_delete(id1, atype, id2) — Deletes the association (id1, atype, id2) and the inverse if it exists.
assoc_change_type(id1, atype, id2, newtype)– Changes the association (id1, atype, id2) to (id1, newtype, id2), if (id1, atype, id2) exists.

Association Query API

The starting point for any association query is an originating object and an association type.

A characteristic of the social graph is that most of the data is old.

Normally many of the queries are for the newest subset. This creation-time locality arises whenever an application focuses on recent items.

Association queries are organized around association lists. A list has All associations with a particular id1 and atype, arranged in descending order by the time field as shown below:

Association List: (id1, atype) →[anew . . .aold]

TAO enforces a per-atype upper bound (typically 6,000) on the actual limit used for an association query. To enumerate the elements of a longer association list, the client must issue multiple queries, using pos or high to specify a starting point.

TAO Architecture

Storage Layer

Objects and associations were stored in MySQL.

The TAO API is mapped to a small set of simple SQL queries.

It is also possible to map these APIs to range scans in a non-SQL data storage system, provided required indexes are available.

As the data is too large for a single MySQL instance, TAO divided data into logical shards. Each shard is contained in a logical database. Database servers are responsible for one or more shards.

By default, all object types are stored in one table, and all association types in another.

Each object_id contains an embedded shard_id that identifies its hosting shard.

An association is stored on the shard of its id1, so every association query can be served from a single server.

Caching Layer

TAO’s cache implements the complete API for clients.

The caching layer consists of multiple cache servers that together form a tier.

A tier is collectively capable of responding to any TAO request. Each request maps to a single cache server using a sharding scheme.

Clients issue requests directly to the appropriate cache server, responsible for completing the read or write.

The TAO in-memory cache contains objects, association lists, and association counts. We fill the cache on demand and evict items using a least recently used (LRU) policy.

Write operations on an association with an inverse that may involve two shards. The tier member that receives the query from the client issues an RPC call to the member hosting id2. TAO does not provide atomicity between the two updates. If a failure occurs, the forward may exist without an inverse. Hanging associations are scheduled for repair by an asynchronous job.

Client Communication Stack

It is common for hundreds of objects and associations to be queried while rendering a Facebook page. The latency of TAO requests can be much higher than those of memcache because TAO requests may access the database.

We use a protocol with out-of-order responses to avoid head-of-line blocking on multiplexed connections.

Leaders and Followers

A single cache tier could be scaled to handle any foreseeable aggregate request rate so long as shards are small enough. Practically it is not possible.

To add servers while limiting the maximum tier size, TAO split the cache into a leader tier and multiple follower tiers.

TAO has a single cache coordinator per database. This split allowed the coordinators to be in a single tier per region.

One leader hosts each shard, and all write goes to the shard through that leader. It makes followers inconsistent. TAO provides eventual consistency by asynchronously sending cache maintenance messages from the leader to the followers.

Scaling Geographically

The leader and followers configuration allows TAO to scale to handle a high workload. Read throughput scales with the total number of follower servers in all tiers.

Follower tiers can be thousands of miles apart. In this configuration, network round trip times can quickly become the bottleneck of the overall architecture.

TAO chose a master/slave architecture that requires writes to be sent to the master but allows read misses to be serviced locally. TAO propagates update notifications asynchronously to maximize performance and availability at the expense of data freshness.

The choice of master and slave is made separately for each shard.

TAO solves this problem by choosing data center locations clustered into only a few regions. Here the intra-region latency is small (typically less than 1 millisecond). Now, it is sufficient to store one complete copy
of the social graph per region. See the diagram below for full architecture.

Implementation

In this section, we will talk about important optimizations for performance
and storage efficiency.

Caching Servers

TAO aggressively caches objects and associations to provide good read performance.

TAO’s memory management has a slab allocator. It manages slabs of equal size items. A slab item can hold one node or one edge list.

A slab is a thread-safe hash table. The eviction policy is LRU for equal size items. Also, there is a dynamic slab rebalancer that keeps the LRU eviction ages similar across all types of slabs.

TAO partitions the available RAM into arenas. This allows TAO to extend the cache lifetime of important types.

For small fixed-size items, such as association counts, TAO stores these
items separately, using direct-mapped 8-way associative caches that require no pointers.

MySQL Mapping

Each shard is assigned to a logical MySQL database that has a table for objects and a table for associations.

All of the fields of an object are serialized into a single ‘data‘ column.
Objects that benefit from separate data management polices are stored in separate custom tables.

Associations are stored similarly to objects but to support range queries, their tables have an additional index based on id1, atype, and time.

To avoid potentially expensive SELECT COUNT queries, the association counts
are stored in a separate table.

Cache Sharding and Hot Spots

In a tier, shards are mapped onto cache servers using consistent hashing. TAO rebalances load among followers with shard cloning, in which reads to a shard are served by multiple followers in a tier.

Consistency management messages for a cloned shard are sent to all
followers hosting that shard.

When a follower responds to a query for a hot item, it includes the object or association’s access rate. The TAO client caches the data and version.

If the access rate exceeds a certain threshold, the follower can omit the data in replies if the data has not changed since the previous version.

Consistency and Fault Tolerance

TAO has two most important requirements as availability and performance.

In case of failure, It is necessary to continue to render Facebook, even if the data is stale.

Consistency

Objects and associations in TAO are eventually consistent. Replication
lag is usually less than one second.

TAO provides read-after-write consistency within a single tier for at most one failure.

When a write is successful, the master leader returns a changeset. This changeset propagated through the slave leader ( if present) to the follower from where the write query originated.

Failure Detection and Handling

TAO scales to thousands of machines over multiple geographical locations, TAO must detect potential failures and route around them.

TAO servers employ aggressive network timeouts to avoid waiting on responses that may never arrive.

Each TAO server maintains per-destination timeouts, marking hosts as down if there are several consecutive timeouts This simple failure detector works well.

Database failures: Databases are marked down in a global configuration if they crash, if they are taken offline for maintenance, or if they are replicating from a master database and they get too far behind.

When a master database is down, one of its slaves is automatically
promoted to be the new master.

Leader failures: When a leader cache server fails, followers automatically route read and write requests around it.

Refill and invalidation failures: Leaders send refills and invalidations asynchronously. If a follower is unreachable, the leader queues the message to disk to be delivered later.

Follower failures: In the event that a TAO follower fails, followers in other tiers share the responsibility of serving the failed host’s shards.

Performance

Availability: Over 90 days, the fraction of failed TAO queries as measured from the web server was 4.9×10−6.

Follower capacity: The peak throughput of a follower depends on its hit rate. See the diagram below for production env data.

Throughput of an individual follower

Hit rates and latency: TAO’s overall hit rate for reads was 96.4%. The diagram below shows the client-observed latencies for reads.

Client-observed TAO latency in milliseconds for read requests

TAO’s writes are performed synchronously to the master databaseThe diagram below compares the latency in two data centers 58.1 milliseconds away from each other (average round trip). The average write latency in the same region as the master was 12.1 msec; in the
remote region it was 74.4 = 58.1 + 16.3 msec.

Write latency from clients.

Replication lag: TAO’s slave storage servers lag their master by less than 1 second during 85% of the tracing window, by less than 3 seconds 99% of the time, and by less than 10 seconds 99.8% of the time.

Failover: Follower caches directly contact the database when a leader is unavailable. This failover path was used on 0.15% of follower cache misses. Failover for write requests involves delegating those requests to a random leader, which occurred for 0.045% of association and object writes. Slave databases were promoted to be the master 0.25% of the time due to
planned maintenance or unplanned downtime.

Related Work

Terry describes eventual consistency, the relaxed consistency model
which is used by TAO.

Werner describes read-after-write consistency as a property of some variants of eventual consistency.

The Coda file system uses data replication to improve performance and availability in the face of slow or unreliable networks.

Megastore is a storage system that uses Paxos across geographically distributed data centers to provide strong consistency guarantees and high availability.

Spanner, the next-generation globally distributed database developed at Google after Megastore, introduces the concept of a time API that exposes time uncertainty and leverages that to improve commit throughput and provide snapshot isolation for reads.

Amazon’s Dynamo demonstrates how they can be used in building flexible and robust commercial systems.

LinkedIn’s Voldemort also implements a distributed key-value store but for a social network.

Dabek focus on designing DHTs in a wide-area network.

Gribble provides a coherent view of cached data by leveraging
a two-phase commit. Glendenning built a linearizable key-value store tolerant of churn.

The COPS system provides causal consistency in a highly available key-value store by tracking all dependencies for all keys accessed by a client context. Eiger improves on COPS by tracking conflicts between pending operations in a column-family database.

TAO follows the recent trend of shifting away from relational databases towards structured storage approaches. This more scalable approach includes Google’s BigTable, Yahoo!’s PNUTS, Amazon’s SimpleDB, and Apache’s HBase.

Shao and Wang’s Trinity stores its graph structures in memory.

Neo4j is a popular open-source graph database that provides ACID semantics and the ability to shard data across several machines.

Twitter uses its FlockDB to store parts of its social graph.

PEGASUS and Yahoo’s Pig Latin are systems for data mining and graphs analysis on top of Hadoop. Google’s Pregel tackles many graph analysis issues but uses its more-expressive job distribution model.

Facebook has similar large-scale offline graph-processing systems that
operate on data copied from TAO’s databases, but these analysis jobs do not execute within TAO itself.

Conclusion

This paper highlights three things:

  1. First, it characterize a challenging Facebook workload: queries
    that require high throughput, low latency read access to
    the large, changing social graph.
  2. Second, it describes the objects and associations data model for Facebook’s social graph and the API that serves it.
  3. Lastly, it detail out TAO, our geographically distributed system that implements this API.

TAO’s separation of the cache and the persistent store has allowed those layers to be independently designed, scaled, and operated, maximizing the reuse of components across our organization.

TAO chooses different tradeoffs for efficiency and consistency at the two layers and uses an idempotent cache invalidation strategy. TAO’s restricted data and consistency model allows an efficient and highly available implementation.

References

Amazon SimpleDB: http://aws.amazon.com/simpledb/.

LevelDB: https://code.google.com/p/leveldb

Voldemort: http://project-voldemort.com/

Bigtable: https://research.google.com/archive/bigtable-osdi06.pdf

PNUTS: https://sites.cs.ucsb.edu/~agrawal/fall2009/PNUTS.pdf

Spanner: https://research.google.com/archive/spanner-osdi2012.pdf

Dabek: https://pdos.csail.mit.edu/papers/dhash:nsdi/paper.pdf

FlockDB: http://engineering.twitter.com/2010/05/introducingflockdb.html

Glendenning: https://homes.cs.washington.edu/~tom/pubs/scatter.pdf

COPS: https://www.cs.cmu.edu/~dga/papers/cops-sosp2011.pdf

Neo4j: http://neo4j.org/

Scaling Memcache at Facebook: https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf

Redis: http://redis.io/

Trinity: http://research.microsoft.com/enus/projects/trinity/

HBase: http://hbase.apache.org

Pig Latin: https://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod08/p1099-olston.pdf

Gribble: https://people.eecs.berkeley.edu/~culler/papers/dds.pdf

PEGASUS: https://www.cs.cmu.edu/~ukang/papers/PegasusKAIS.pdf

Terry: https://people.cs.umass.edu/~mcorner/courses/691M/papers/terry.pdf

Coda: https://www.cs.cmu.edu/~satya/docdir/satya-tocs-codaevol-2002.pdf

--

--

Hemant Gupta
Hemant Gupta

Written by Hemant Gupta

https://www.linkedin.com/in/hkgupta/ Working on the challenge to create insights for 100 software eng papers. #100PapersChallenge, AVP Engineering at CoinDCX

No responses yet