Irene Y. Zhang: Lessons learned from TAPIR(s)

Lessons learned from TAPIR(s)

Having just finished yet another iteration of the TAPIR paper, I thought I would try to distill what I have learned from the project. It has been a much harder and more interesting project than I anticipated but the photos of baby tapirs helped me get through it.

1. Replication protocols should be co-designed with the systems that use them, not applications.

Applications do not commonly use replication protocols directly today; instead, they are usually packaged into a fault-tolerant service or storage system. As Mike Burrows notes [1], there are a lot of reasons to put Paxos into a lock server and present it as a service, rather than package it as a library. With this in mind, we should be designing replication protocols for distributed systems, not applications.

Unfortunately, many high-performance replication protocols are designed for applications, so are often not super useful for distributed systems. For example, eventual consistency protocols expect the application to resolve conflicts after they happen, but a storage system using the protocol has no idea how to resolve a conflict in data from an application (i.e., Dynamo [2] cannot resolve a conflict in the Amazon shopping cart, which is why I always end up with two boxes of cereal from Amazon Fresh). Protocols for commutative operations, like Generalized Paxos [3] and EPaxos [4], have a similar problem (i.e., they do not know which application operations commute).

2. Paxos is too strict (and expensive) for a lot of systems.

Today, systems that want to provide strong guarantees use Paxos [6] (or, if you are hipper than me, RAFT [7]), and everyone else uses something cheaper. Paxos enforces a strict serial ordering of operations across replicas, which is useful, but requires coordination across replicas on every operation, which is expensive.

What we found in the TAPIR project is that Paxos is too strong for some strong system guarantees and, as a result, is wasting work and performance for those systems. For example, a lock server wants mutual exclusion, but Paxos provides a strict serial ordering of lock operations. This means that a lock server built using Paxos for replication is coordinating across replicas even when it is not necessary to ensure mutual exclusion.

Even more interesting, a transactional storage system wants strictly serializable transactions, which requires a linearizable ordering of transactions but only requires a partial ordering of operations (because not all transactions touch all keys). With some careful design in TAPIR, we are able to enforce a linearizable ordering of transactions with no ordering of operations.

3. We can still provide the same strong guarantees to applications without consistency!

At the core of the TAPIR project is a new replication protocol, which we call inconsistent replication or IR, that provides fault-tolerance but no consistency. IR is well-designed to provide high-performance replication for a class of system guarantees, including mutual exclusion and ACID with linearizable transaction ordering. Instead of an ordered operation log, IR provides an unordered operation set. Without ordering, IR is able process operations without a leader or any synchronous cross-replica coordination, similar to eventual consistency protocols.

But, unlike eventual consistency, IR allows distributed systems to avoid conflicts before they happen, rather than resolving them after they happen. This allows systems to retain strong guarantees for their applications and hide any inconsistencies from application programmers. IR’s limitation is that it can only support system guarantees that depend on conflicts between pairs of operations. For example, IR can be used to replicate a lock server but not a sales app that only sells thing in stock. The lock server’s mutual exclusion guarantee is a pair-wise invariant, but calculating the store’s stock requires a full history of operations that only strong consistency can achieve.

4. Good clock synchronization is available and we should use it.

Without ordering in the replication layer, TAPIR needs another way to efficiently order transactions, so it uses loosely synchronized clocks for better performance. Throughout the project, we have continuously been surprised by how good clock synchronization is on various platforms. In fact, we designed several extensions to TAPIR to cope with clock skew but never had to pull them out.

When we first started testing on our local cluster using PTP [8], which is available as a Linux package and does not require any special hardware, the clock synchronization was so good that it broke our random number generator because we were seeding it with the clock! As others [9] have noticed, PTP is able to offer at least sub-microsecond synchronization without any special support. When we moved to wide-area experiments using Google Compute Engine, we found clocks skews within milliseconds, even though we were using virtual machines over the wide-area.

Summary

We should design better replication protocols optimized for distributed systems, not applications. For a class of strong system guarantees, Paxos is too strong and eventual consistency is too weak. Instead, we have designed inconsistent replication, a new replication protocol that provides fault-tolerance without consistency for distributed systems. Using IR and loosely synchronized clocks, TAPIR is able to provide high-performance distributed transactions.

We have experimentally verified that baby tapirs are adorable. By looking at countless photos of baby tapirs in real-world conditions, we have measured 100% of baby tapirs to be extremely cute.

References

[1] M. Burrows. The Chubby Lock Service for Loosely-coupled Distributed Systems. In Proc. of OSDI, 2006.

[2] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s Highly Available Key-value Store. In Proc. of SOSP, 2007.

[3] L. Lamport. Generalized Consensus and Paxos. Technical Report 33, Microsoft Research, 2005.

[4] I. Moraru, D. G. Andersen, and M. Kaminsky. There is more Consensus in Egalitarian Parliaments. In Proc. of SOSP, 2013.

[5] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for Internet-scale Systems. In Proc. of USENIX ATC, 2010.

[6] L. Lamport. Paxos Made Simple. ACM Sigact News, 2001.

[7] D. Ongaro and J. Ousterhout. In Search of an Understandable Consensus Algorithm. In Proc. of USENIX ATC. 2014.

[8] IEEE 1588 Standard for A Precision Clock Synchronization Protocol for Networked Measurement and Control Systems.

[9] J. Perry, et al. Fastpass: a centralized zero-queue datacenter network. In Proc. of SIGCOMM. 2014.

Contact

irene.zhang @microsoft.com
Building 99
14820 NE 36th St.
Redmond, WA 98052
iyzhang@github
irene@discuss.systems

News

25 August 2025
Updates on workshop publications this year! Congratulations to Katie Lim on her [HotOS paper on operating systems for FPGA accelerated NICs](papers/apiary-hotos25.pdf) and Inho Choi on his [APSys paper on ML-tuned datapath operating systems architecture](papers/autokernel-apsys25.pdf).
04 March 2025
New Blog Post: University of Washington Sexual Harassment and Bullying Investigation
02 November 2024
Our work, led by [Katie Lim](https://homes.cs.washington.edu/~katielim/), on a [modular architecture for NIC-attached accelerators](https://arxiv.org/pdf/2403.14770) was presented in [MICRO 2024](https://microarch.org/micro57/)!

Older posts…