Silver keeps track of all changes in the log, enabling users to go back to any point in time for almost free. It is built to be truly append-only and never overwrites data. While other append only file systems exist, especially in the realm of optical media, Silver combines Corfu with Replex, a unique replication protocol which enables efficient log virtualization to support multiple writers, low-latency linearizable reads, and the ability to create fast copy-on-write clones.Silver is built on top of a distributed, shared log. Our log is made of two components: a high throughput sequencer which orders append operations and a write-once storage device which stores updates. This design has been shown to scale to more than half a million operations per second. We have previously described an implementation of this design on a FPGA in hardware. As described in Tango, in addition to the basic log append and random read operations, our log also supports streams, which are essentially virtualized logs within our log, identified by a unique 128-bit id. In Silver, we implement log virtualization using Replex , a unique replication protocol which enables fast, random access to streams by building a secondary index during replication. This avoids the overhead and complexity of traversing back pointers in Tango. We describe the basic operations supported by our log in table 6.2. Silver is entry oriented, not block oriented, and clients may only append and read entries. Entries in our log are variably-sized and we do not impose a size limit. Even though users may delete directories or overwrite data,arandano cultivo the log preserves the order in which changes are applied to the file system, so that a delete operation can easily be “undone”.
Every address in the log is a version in the file system, and “time-travel” can be done by restricting traversal to a specific set of addresses. Every file system starts with a root directory. Clients use the stream ID of the root directory to find the file system. To open a file system, an uninitialized client first calls the check function to get the last update on the root directory from the sequencer. The client then must read the entire root directory stream and apply those updates in-memory to get the current state of the root directory. Once the root directory is read, subsequent traversals of the file system selectively read the streams necessary to satisfy that request. For example, to perform a ls of a child directory, the client would only need to read the stream for that directory. All requests for metadata are served quickly and efficiently from in memory state, and updating that state consistently involves merely contacting the sequencer, reading any updates on the stream and then applying it to the in-memory state. This permits fast data structures such as hash tables and skip-lists to be used rather than the traditional B-trees for the file map. However, keeping the data for large files in memory would be costly and impractical. To implement large files efficiently, we implement extent entries, which contain the entire file within a single entry, and we support partial reads of that entry. The single entry is written sequentially into the log and does not require multiple pointers or a tree structure to traverse, unlike traditional file systems which write many extents that must be traversed through a tree due to allocation and reallocation of the extent space. This allows fast random accesses to large chunks of data stored efficiently within the log.An important optimization is that pointers in Silver always refer to other streams by their id, rather than a physical address as in other file systems. For example, in figure 6.5, directory a points to metadata stream b, rather than physical address 2. This allows the pointer to b to remain valid even after it is updated and clients can quickly get the latest version by contacting the appropriate stream replica.
In this manner, we eliminate the dependency on the equivalent of the inode map in LFS. Other file systems, like btrfs, suffer from the recursive update problem so updates must propagate to the root, as the pointer to physical address 2 is no longer valid, so a new root must be written pointing to the new update leading to significant write amplification.File systems today employ many reader-writer locks for mutual exclusion where read and write paths intersect, since writers may overwrite previously written data. In Silver, the read path and write path are separated since no overwriting occurs: any successful write is immutable, which obviates the need for mutual exclusion logic. Mutual exclusion is a even bigger problem in a distributed systems where many have resorted to relaxed consistency for performance, Silver is able to never compromise consistency by never overwriting. This separation greatly simplifies caching in Silver. Silver currently implements a efficient LRU cache of log entries in DRAM for faster log traversal and to cache large files, while in-memory state serves as a cache for both metadata and small files. In a traditional distributed file system, eviction of stale data must be detected and often the entire file must be copied. In Silver, updating stale state simply involves contacting the sequencer, which informs the client of staleness immediately,frambuesa maceta and reading any updates, which are stored as deltas from the log, and applying the state in-memory.Isolating the write path also simplifies tiering, and allows Silver to easily take advantage of heterogeneous storage systems. For example, consider a storage system which consists of fast NVM, SSD and hard disks. A tiered Silver system could direct writes first to fast NVM. As the NVM fills up sequentially, writes can be evicted to the SSD in large sequential chunks, permitting the NVM to be reused. Before eviction, a small update to the read cache’s map would allow reads to continue being serviced, and the read cache would make sure that popular old entries still have fast service times, despite being evicted out to archival storage. Even though Silver is designed on an ever-growing log, we recognize that practical storage considerations may limit the number of updates which a system can store. To mitigate this limitation, we offer two mechanisms: first, we offer a compress command which compresses prefixes of the log, and appends the compressed prefixes to the end of the log. Second, we offer a checkpoint mechanism which takes an address and compacts the history of each stream into a single entry to free space. However, once a checkpoint is performed, “time-travel” to history previous to the checkpoint is no longer permitted since history is lost. Users may choose to checkpoint only the histories they are interested in.
As compression and check pointing always free data from the start of the log and append to the end of the log, this allows a fixed storage space to be used as a circular buffer.Since every update to Silver is logged, Silver supports full time travel by playing back the log. Snapshots are truly free in Silver, unlike other log-based systems where snapshots must be explicitly created to freeze data blocks. In addition, Silver is always consistent because the log is an authoritative source for the ordering of writes. Shipping a snapshot of Silver is as simple as transferring a prefix of the log. Currently a user of Silver can read the file system version by reading a special entry in the file system – reading this entry queries the sequencer. To access the snapshot, a clone command is provided which takes the version number and directory as a parameter and creates a new fully writeable clone of the directory using CoW, which is described in the next section. Copy on Write is a feature supported by many logging file systems. It enables a file system to quickly create a diverging copy by referencing the source and only recording changes. Silver supports CoW of any stream. The copy command creates a CoW entry with the source stream and an address, which denotes the version where the new stream diverges from the source stream. When a directory entry is copied, subsequent traversals to children of that directory create copies as well to ensure that the new history diverges from the source. Other CoW file systems suffer from fragmentation under random writes because they must allocate data blocks or extents for a CoW file and update pointers for every write. This problem is so bad that most logging file systems recommend disabling CoW support, especially for large files. Silver does not suffer from this problem: random writes are sequentialized by virtue of the log, and the fast streaming support eliminates problems fragmentation may pose. We have prototyped Silver in Java 8 over FUSE, through the use of Java Native Runtime bindings. While the use of Java prevents our current prototype from performing as well as a native implementation, our prototype enables us to understand and evaluate the key benefits of Silver. Our implementation is scalable and distributed: following the design of Corfu [13], our log can be sharded and replicated across many nodes. However, while we aim to be as POSIX compliant as possible, due to limitations in FUSE, Silver is not fully POSIX-compliant. Furthermore, the use of FUSE adds significant overhead. Our current implementation Silver is elegant and concise, taking a mere 3,186 SLOC in Java. Part of the simplicity of Silver owes to the fact that much of the streaming logic is implemented in the distributed log, which is about 15k SLOC but reusable by other applications. We also envision that the distributed log may be implemented in hardware.While Silver is distributed, single node log-structured file systems still serve as a useful comparison point for Silver’s design. Historically, these stores were introduced mostly in order to serialize IO, and not in order to expose the history of versions. Systems like LFS and Zebra aggressively garbage-collected all by the latest updates to any block. Consequently, the metadata issues they tackle are quite different: the name space did not need to expose versions, nor support cloning and snapshots. btrfs, nilfs, WAFL and ZFS are copy-on-write file systems with snapshot capability. These systems primarily log metadata, but data is typically stored in allocated regions called extents in btrfs and slabs in ZFS. Generating snapshots, as a result, is not automatic as the file system must lock data regions to create a snapshot. Furthermore, the allocated regions are at risk for inconsistency, whereas in Silver the log is the pristine source of ordering for all metadata and file data. Finally, allocators can suffer from fragmentation especially with a high number of random writes. Silver converts random writes into compact sequential writes while an efficient cache mechanism enables these writes to be read efficiently.For example, UDF is an example of a system designed for optical media, and Quinlan describes a file system for early WORM media. Unlike Silver, these file systems are mainly designed for archival content, which is reflectedby their slow access times and mount times.Most distributed file systems such as the popular HDFS, Ceph and CalvinFS separate the storage of metadata and data. This separation makes it difficult to create true snapshots: for example, in HDFS and Ceph snapshots only capture the state of the metadata. Data, which is stored separately from metadata can continue to change after a snapshot, resulting in a inconsistent snapshot. Furthermore, these file systems are not copy-on-write, so creating a modifiable clone is an expensive operation. Tango presented the concept of a stream within a distributed log. In Tango, streams never referred to each other: rather they contained opaque objects on a single log so that transactions could be run against them. In Silver, our file system leverages many streams in order to support efficient accesses and copies. In Tango, back pointers are used to enable streams, which imposes a burden on both the sequencer, which must persist the last token that was issued for each stream, and during failures, which require scanning the log to find the stream. Furthermore, bulk reads of a stream are not possible with back pointers.