Insights from paper — Frangipani — A Scalable Distributed File System
1. Abstract
Frangipani is a file system that provides coherent shared access to files, higher performance, and high availability.
It is built using two layers.
The lower layer is Petal. Petal is a distributed storage service that provides incrementally scalable, highly available, automatically managed virtual disk structures.
The upper layer is the Frangipani file system code on top of a shared Petal virtual disk. It uses a distributed lock service for cache coherence. We will discuss this in detail.
Frangipani is meant to run in a cluster of machines under a common administration that can communicate securely. The machines should trust one another.
2. Introduction
File system administration is typically laborious for a large, growing user community. More disks must be attached to more machines to serve more users, which requires human administration.
Joining multiple disk drives into one unit using RAID technology is only a partial solution. As users grow, we need multiple RAIDs and multiple server machines.
Frangipani is a scalable distributed file system. It manages a collection of disks on multiple machines as a single shared pool. It has a straightforward internal structure. Its set of cooperating machines uses a common store, and access to the store is synchronized with locks.
Frangipani has a nice set of features as below:
- All users are given a consistent view of the same set of files.
- More servers can be added easily.
- A system administrator can add new users without any concern for machines or disks.
- A system administrator can make a complete and consistent backup of the entire file system without bringing the system down.
- The file system tolerates and recovers from machine, network, and disk failures.
Let’s understand the lower layer of Petal first.
Petal is an easy-to-administer distributed storage system that provides virtual disks to its clients. A Petal virtual disk provides storage that can be read or written in blocks.
The virtual disk provides a sparse 2⁶⁴ byte address space. The physical storage is allocated only on demand. Petal can replicate data for high availability if configured. It offers efficient snapshots.
Now, the question is, what does the upper layer of Frangipani do?
Frangipani inherits much of Petal’s scalability, fault tolerance, and ease of administration. It extends these properties to the file system level while Petal works on the disk block level.
Frangipani layering
There are multiple Frangipani servers. They provide access to the shared files using a Petal virtual disk. It also coordinates the actions from different servers using locks so that there is coherence.
There is one critical point to understand. All three components, Frangipani servers, Petal service, and Lock service, can be scaled independently. This provides excellent fault tolerance and load-balancing.
Thanks for reading Hemant’s 100PapersChallenge! Subscribe for free to receive new posts and support my work.
3. System Structure
Before we proceed to understand the system structure and its components, Let’s examine a typical assignment of functions to machines.
Frangipani structure
The top machines contain user programs and the Frangipani file server module. These machines can be diskless.
The bottom ones run Petal and Distributed Lock service.
The Petal and Frangipani servers need not be on separate machines. Similarly, the lock service can run on the Frangipani hosts or other machines.
3.1 Components
Let’s start with how a user programs access Frangipani.
The typical operating system call is the interface for users’ programs to access Frangipani. Using Frangipani will make you feel like you are working with a local disk.
The changes made to a file or directory on one machine are visible immediately to all machines. This is the crux of the whole system in one line.
The file content data is staged through the local kernel buffer pool like a Unix file system. Thus, the changes will likely not reach stable storage until the next sync or sync system call.
The metadata changes are logged and can optionally be guaranteed to be stored in stable storage by the time the system call returns.
Frangipani does one small extra thing. It maintains a file’s last-accessed time to avoid writing metadata for every data read.
The Frangipani file server module registers itself with the kernel’s file system switch as one of the available file system implementations.
The module reads and writes Petal virtual disks using the local Petal device driver.
A Frangipani server does one crucial thing here. It keeps its own redo log of pending changes in a distinct section of the Petal disk. If a server crashes, another server can access these logs and run recovery.
By now, It would have been clear that Frangipani servers don’t communicate directly with one another.
The Petal device driver manages Petal’s distributed nature. It is responsible for contacting the correct Petal server and failing over if necessary. Petal can tolerate one or more disk or server failures.
Let’s talk about the lock service now.
The lock service is a general-purpose lock service that provides multiple-reader/single-writer locks to clients. Its implementation is distributed, so it provides fault tolerance and scalable performance.
Frangipani uses this lock service to coordinate access to the Petal virtual disk. The lock service’s most important function is keeping the buffer caches coherent across multiple servers.
3.2 Security and the Client/Server Configuration
Security concerns will arise if a machine hosts both the user programs and the file server module. Since the Frangipani server can read/write any block on a virtual disk, it should run only on machines with trusted operating systems.
Also, there is one more point. To ensure the file data is private, users should not be able to eavesdrop on the network connecting the Petal service and Frangipani machines.
One simple solution is
- Place the machines in an environment where users cannot boot modified operating system kernels on them.
- And interconnect the machines into a private network so that user processes are not granted access to the network.
Another possible solution is distinguishing between Frangipani client and server machines, as shown in the diagram below.
Client/server configuration
Here, remote, untrusted clients talk to the Frangipani servers through a separate network. This network has no direct access to the Petal service.
There are a few points that can be problematic with Frangipani. Let’s discuss them. They are the following:
- If you use a replicated Petal virtual disk, then the logging will be done twice. Once to the Frangipani log and once again within Petal itself
- Frangipani does not use disk location data, so it is unknown.
- Frangipani locks entire files and directories rather than individual blocks.
4. Disk Layout
As we discussed briefly, the virtual disk has 2⁶⁴ bytes of address space.
Petal commits physical disk space to virtual addresses and also de-commits when required to free the backing physical space.
Petal commits and de-commits space in pretty large chunks, currently 64 KB.
The diagram below shows the division of the virtual disk:
The first region stores shared configuration parameters and housekeeping information. The second region stores logs. The third region is used for allocation bitmaps, to describe which blocks in the remaining regions are free. The fourth region holds inodes. Each file needs an inode to have its metadata. The fifth region holds small data blocks, each 4 KB is size. The remainder of the Petal (sixth region) address space holds large data blocks.
The small files can be stored in the inode itself to save space.
Almost 2²⁴ (16 million) large files (of size 64 KB or more) can be stored.
The maximum size file can be larger than 16 small blocks plus 1 large block (64 KB plus 1 TB).
A few configuration changes can help overcome some of these limitations, like increasing the large blocks to increase the maximum file size. A single Frangipani server can support multiple Frangipani file systems with multiple 2⁶⁴ virtual disks.
5. Logging and Recovery
I briefly discussed each server’s write-ahead redo logging of metadata. This simplifies failure recovery and improves performance.
To update metadata, a server first creates a record describing the update and then appends it to its log in the memory. These log records are periodically written in the second region in the same order as the requested updates. After a log record is written, the server modifies the actual metadata.
The space allocated for each log is managed as a circular buffer. When the log fills, Frangipani reclaims the oldest 25% of the log space for new log entries. If metadata blocks have not yet been written, this work is completed before the log is reclaimed.
After detecting a failure, Frangipani runs recovery on that server’s log. The recovery demon is implicitly given ownership of the failed server’s log and locks.
Frangipani ensures that logging and recovery work correctly in the presence of multiple logs. To achieve this, it does two things:
- Frangipani’s locking protocol ensures that updates requested to the same data by different servers are serialized.
- It also Frangipani ensures that recovery applies only to updates that were logged.
Frangipani ensures that only one recovery demon is trying to replay the log region of a specific server at a time. The lock service guarantees this by granting the active recovery demon an exclusive lock on the log.
6. Synchronization and Cache Coherence
Multiple servers try to access the shared on-disk data, so concurrent access with high performance is required.
Frangipani uses multiple-reader/single-writer locks to implement the necessary synchronization.
When the lock service detects conflicting lock requests, the current holder of the lock is asked to release or downgrade it to remove the conflict.
A read lock allows a server to read the associated data from a disk and cache it. If a server is asked to release its read lock, it must invalidate its cache entry before complying.
A write lock allows a server to read or write the associated data and cache it. A server’s cached copy of a disk block can be different from the on-disk version only if it holds the relevant write lock.
If a server is asked to release its write lock or downgrade it to a read lock, it must write the dirty data to disk before complying. The server should retain its cache entry if it is downgrading the lock but must invalidate it if releasing the lock.
Divining on-disk data structures into lockable segments is designed to keep the number of locks reasonably small.
Each log and bitmap space is a single lockable segment.
A data block or inode that is not currently allocated to a file is protected by the lock on the segment of the allocation bitmap that holds the bit, marking it as free.
Each file, directory, or symbolic link is one segment for locking.
This per-file lock granularity is appropriate for engineering workloads where files rarely undergo concurrent write sharing.
7. The Lock Service
The locks in Frangipani are sticky. This means a client will generally retain a lock until another client needs a conflicting one.
The lock service deals with client failure using leases. All locks the client acquires are associated with the lease. Each lease has an expiration time (30 seconds as of now). A client must renew its lease before the expiration time, or the service will consider it to have failed.
In case of network failure, the server discards all its locks and the data in its cache.
Firangipani’s lock service implementation is fully distributed for fault tolerance and scalable performance. It consists of a set of mutually cooperating lock servers and a clerk module linked to each Frangipani server.
The lock service organizes locks into tables. 64-bit integers name individual locks within tables.
Each file system has a table associated with it ( as we discussed the multiple file systems inside the server).
When a Frangipani file system is mounted, the Frangipani server calls into the clerk, which opens the lock table associated with that file system.
The lock server gives the clerk a lease identifier on a successful opening.
When the file system is unmounted, the clerk closes the lock table.
Clerks and the lock servers communicate via asynchronous messages. The basic message types that operate on locks are request, grant, revoke, and release.
The request and release message types are sent from the clerk to the lock server.
The grant and revoke message types are sent from the lock server to the clerk.
Lock upgrade and downgrade operations are also handled using these four message types.
The lock service uses a fault-tolerant, distributed failure detection mechanism (the same one Petal used) to detect lock servers crashing.It is based on the timely exchange of heartbeat messages between sets of servers.
A small amount of global state information that does not change often is consistently replicated across all lock servers using Lamport’s Paxos algorithm.
The global state information consists of a list of lock servers, a list of locks that each is responsible for serving, and a list of clerks that have opened but not yet closed each lock table.
For efficiency, locks are partitioned into about one hundred distinct lock groups and assigned to servers by group, not individually.
Locks are occasionally reassigned across lock servers to compensate for a crashed lock server or to take advantage of a newly recovered lock server.
When a Frangipani server crashes, the locks that it owns cannot be released until appropriate recovery actions have been performed. When a Frangipani server’s lease expires, the lock service will ask the clerk on another Frangipani server to perform recovery and then release all locks belonging to the crashed Frangipani server.
In general, the Frangipani system tolerates network partitions, continuing to operate when possible and otherwise shutting down cleanly.
There is a corner case when a Frangipani server does not really crash and comes up after some time. It may still try to access Petal after its lease has expired. Petal does not check when a written request arrives. If there is a sufficient time delay between Frangipani’s lease check and the arrival of the subsequent write request at Petal, there could be a problem, as the lease would have expired.
Frangipani uses a large enough error margin to avoid it, but it is still unsafe. The team would like to eliminate this hazard soon.
8. Adding, Removing Servers and Backup
Adding another Frangipani server is easy. The new server only needs to know which Petal virtual disk to use and where to find the lock service.
It contacts the lock service to obtain a lease, determines which portion of the log space to use from the lease identifier, and goes into operation.
Similarly, removing a Frangipani server is even easier.
It is sufficient to shut the server off simply. It is preferable for the server to flush all its dirty data and release its locks before halting, but this is not strictly needed. If the server halts abruptly, recovery will run on its log the next time one of its locks is needed. It will bring the shared disk into a consistent state.
Petal’s snapshot feature helps the Frangipani file system to make consistent complete dumps conveniently.
Petal allows a client to create an exact copy of a virtual disk at any point in time.
The snapshot copy is identical to the original but read-only. The implementation uses copy-on-write techniques for efficiency, and the snapshots are crash-consistent.
So, we can back up a Frangipani file system simply by copying a Petal snapshot to a persistent store.
The snapshot will include all the logs, so it can be restored by copying it back to a new Petal virtual disk and running recovery on each log.
9. Related Work
The Cambridge (or Universal) File Server takes a two-layered approach to building a file system. The split between layers is quite different. The lower layer provides its clients with two abstractions: files and indices. These abstractions can be used to implement files and directories.
NFS is not a file system. It is a remote file access protocol. It provides a weak notion of cache coherence, and its stateless design requires clients to access servers frequently.
The Andrew File System (AFS) and DCE/DFS provide better cache performance and coherence than NFS. AFS is designed for a different kind of scalability than Frangipani, which provides a unified cluster file system. AFS has a global namespace and security architecture that allows one to plug in many separate file servers and clients over a wide area.
The Echo file system is log-based, replicates data for reliability and access paths for availability, permits volumes to span multiple disks and provides coherent caching. However, unlike Frangipani, Echo does not provide scalability.
The Spiralog file system also offloads its file system processing to individual cluster members, which run above a shared storage system layer.
The xFS file system is closest in spirit to Frangipani. Both try to distribute the management responsibility for files over multiple machines and provide good availability and performance. However, xFS’s internal representation differs. xFS has a pre-designated manager for each file, and its storage server is log-structured.
References
Petal: Distributed virtual disks
AFS: Scale and performance in a distributed file system