Corfu provides the abstraction of an infinitely growing log to applications


If such a value exists, the client walks down the chain to find the first unwritten replica, and then ‘completes’ the append by copying over the value to the remaining unwritten replicas in chain order. Alternatively, if the first unit of the chain is unwritten, the client writes the junk value to all the replicas in chain order. Corfu exposes this fast hole filling functionality to applications via a fillinterface. Applications can use this primitive as aggressively as required, depending on their sensitivity to holes in the shared log. Second, storage units may fail. We address configuration changes in the next section. Third, due to reconfiguration, a client might fail to read a position even though a write on it has completed. This may happen as follows. Say that a projection is changed to remove a working unit, e.g., due to a network disruption that causes the unit to temporarily detach. Some client may remain uninformed of the reconfiguration, perhaps having detached from the main partition of the system due to the same network disruption. If this unit stored the last page in a replica set for some position in the previous projection, this client might continue to probe this unit for the status of this log position. Then the client may obtain an err_unwritten error code even after the position has been completely written in the new configuration. There are several ways to address this problem. Even with this somewhat rare failure scenario, the system upholds serializability of reads,vertical plant growing which may suffice. Alternatively, we may disable fast reads altogether, and require reads to go to all replicas, unless they already know an entry has been filled.

Another solution is to assign leases to storage units, which automatically expire unless the unit periodically renews them. Some entity in the system, e.g., the sequencer, must manage these leases, which is not a heavy burden as renewals are performed at fairly coarse intervals. However, a cost associated with this approach is that we need to wait for a lease expiration before we can remove a unit from the configuration.When some event occurs that necessitates a change in a projection – for example, when a storage unit fails, or when the tail of the log moves past the current active range – we switch to a new projection. The succession of projections forms a sequence of epochs, and clients can read and write the storage units directly so long as their projection is up-to-date. When we change the projection, we invoke a seal request on all relevant storage units, so that clients with obsolete copies of a projection will be prevented from continuing to access them. All messages from clients to storage units are tagged with the epoch number, so messages from sealed epochs can be aborted In this sense, a projection serves as a view of the current configuration. However, a projection change does not necessarily ditch an old configuration and replace it with a new one. Rather, it may link a new range in a succession of ranges, keeping the old ones intact, and letting clients continue reading log entries which have already been filled. Likewise, it may affect the configuration of some past ranges and not others. So over time, the log evolves in disjoint ranges, each one using its own projection onto a set of storage extents. Care must be taken to ensure that any clients operating in the context of previous epochs and clients in the new epoch read the log consistently. Projections – and the ability to move consistently between them – offer a versatile mechanism for Corfu to deal with dynamism. Figure 3.8 shows an example sequence of projections.Eventually, the log tail moves past 80K, and the system again reconfigures, adding a new range in projection to service reads and writes past 80K.

The mechanics of forming a reconfiguration decision are tailored to the client-centric Corfu architecture, and in particular, involve no communication among the storage units. More specifically, our reconfiguration procedure involves two steps. The first is a ’seal-and snapshot’ step, and its purpose is to ensure that any appended value in the current projection Pi survives reconfiguration. This would not be possible if clients continued writing directly to the active range of Pi indefinitely. We reject any client messages with obsolete epochs, writes as well as reads. When clients receive these rejections, they realize that the current projection has been sealed. We note that storage units not affected by the reconfiguration need not be sealed, and hence, often only a small subset of storage units have to be sealed, depending on the reason for reconfiguration. The second step involves reaching an agreement decision on the next projection Pi+1. The decision must preserve the safety of any log entries which have been fully written, and might have been read by clients, in the current active range. Therefore, we always ‘err on the conservative side’, and include any entry which is filled in all surviving storage units. For example, consider the scenario in Figure 3.8. Moving from to , we lost F6 but retained the mirror F7. The new projection upholds the safety of any log entries which have been fully mirrored to F7. In a slightly different scenario, we might lose F7, and retain F6. In this case, we won’t know if entries [40K,50K) have been successfully mirrored or not, but we would have to assume that they might have,vertical farming and extend the projection accordingly. We do not discuss the details of the actual consensus protocol here, as there is abundant literature on the topic. Our current implementation incorporates a Paxos-like consensus protocol using storage units in place of Paxos-acceptors. Note that, multiple clients can initiate reconfiguration simultaneously, but only one of them succeeds in proposing the new projection. Clients that read a storage unit and discover that their local projection is stale need to learn the new projection. We currently employ a shared network drive for storing the sequence of agreed-upon projections, but other methods may replicate this information more robustly.

The application does not have to move data around in the address space of the log to free up space. All it is required to do is use the trim interface to inform Corfu when individual log positions are no longer in use. As a result, Corfu makes it easy for developers to build applications over the shared log without worrying about garbage collection strategies. An implication of this approach is that as the application appends to the log and trims positions selectively, the address space of the log can become increasingly sparse. Accordingly, Corfu has to efficiently support a sparse address space for the shared log. The solution is a two-level mapping. As described before, Corfu uses projections to map from a single infinite address space to individual extents on each storage unit. Each storage-unit then maps a sparse 64-bit address space, broken into extents, onto the physical set of pages. The storage unit has to maintain a hash-map from 64-bit addresses to the physical address space of the storage. Another place where system information might grow large is the succession of projections. Each projection by itself is a range-to-range mapping, and hence it is quite concise. However, it is possible that an adversarial workload can result in bloated projections; for instance, if each range in the projection has a single valid entry that is never trimmed, the mapping for that range has to be retained in the projection for perpetuity. Even for such adversarial workloads, it is easy to bound the size of the projection tightly by introducing a small amount of proactive data movement across storage units in order to merge consecutive ranges in the projection. For instance, we estimate that adding 0.1% writes to the system can keep the projection under 25 MB on a 1 TB cluster for an adversarial workload. In practice, we do not expect projections to exceed 10s of KBs for conventional workloads; this is borne out by our experience building applications over Corfu. In any case, handling multi-MB projections is not difficult, since they are static data structures that can be indexed efficiently. Additionally, new projections can be written as deltas to the auxiliary share.

In order to support an infinite address space, the storage device must provide a persistent mapping from a 64-bit virtual address onto a physical address. SSDs often use such a structure, although the map’s domain is usually limited to the nominal disk size of the SSD, and the granularity of the mapping function is often coarser than a single page. There is considerable overlap between what is described here and the functionality of the Flash Translation Layer firm ware found in an SSD. Thus, there is good reason to think about merging these components. We discusses this possibility in Section 3.4.6. In practice, an SVA need only be large enough to support the maximal number of writes for a given device. Since NAND flash supports a limited number of erase cycles, we can base our data structures on the notion that the size of an SVA is roughly bounded by the number of flash pages times the maximal erase cycle count. Our current implementation uses a traditional hash table to implement a map that resolves to flash pages of size 4 KB. This data structure occupies 4 MB of memory per GB of target flash. We have designed, but not implemented, a significantly more compact structure using Cuckoo Hashing as described in Section 3.4.4.The referent of the page map contains per-page state as well as an SPA if the page is in the written state or awaiting reclamation. We keep three pointers with regards to the overall SVA space on each SLICE: a head pointer to denote the maximum written entry; and a minimum unwritten pointer ; and a pointer below which all trimmed pages have been reclaimed. An additional pointer indicating the minimum written position can also be used to restrict the set of logical addresses under consideration during prefix trim. These pointers need not be maintained persistently since they can be recovered from the mapping table. All trimmed positions that both lie below the minimum unwritten pointer and have been reclaimed can be eliminated from the map. We optimize the hole-filling operation by using a special value of flash page pointer to denote the junk pattern that is used to fill a hole in the log. Thus, hole-filling can be accomplished by manipulating the mapping table: set the physical page pointer to the junk value and mark the page as trimmed. In addition to the mapping structure, the SLICE implementation must track the set of sealed epochs and maintain a free list of flash pages for new writes. The former must be stored persistently, but the latter can be reconstructed from the mapping table. For best performance, the ordering of the free list should take into account specific peculiarities of the media, such as locality or the need to perform sequential writes within flash blocks. Should it become necessary to efficiently enumerate very sparse logs, we could introduce a data structure to track ranges of reclaimed addresses and an API method to access it. However, the applications we have built so far only walk through the compact portion of logs, so we have not yet found the need to take such measures. Some of the newer API functions that are part of the software implementation of the Corfu server have not yet been fully implemented in hardware. For example, the minimum unwritten pointer was not part of our original hardware design. These features can and will be reintegrated straightforwardly into the current design. Our prototype hardware design is presented in Figure 3.9. Each SLICE comprises an FPGA with a gigabit Ethernet link, a SATA-based SSD, and 2 GB of DDR2 memory. The design is flexible and scalable: hundreds of SLICEs may be used to support a single shared log given sufficient network capacity. We have engineered our SLICE unit to be inexpensive and low-power while delivering sufficient performance to saturate its Ethernet link.