Amazon Dynamo paper notes

Notes based on the paper about Amazon Dynamo. There is however a difference between Dynamo and AWS DynamoDB (at least based on name), however this is not discussed in the document.

Dynamo is designed to be an eventual consistent storage. At the same moment it allows clients to specify number of hosts in the read (R) and write (W) quorums. The number of hosts in the cluster N is also configurable. The good rule of thumb is to keep N >= 3, and R + W > N. In this case, only the correct data would be returned.

Many systems use synchronous data replication to keep consistency strong (e.g. MySQL). This causes issues with availability, b/c if the replica failed, it becomes unavailable to its clients. Only single master host could accept writes in such systems. In opossite to this, Dynamo uses asynchronous data replication algorithms. This helps to build partition tolerated systems, as Dynamo would accept reads and writes even during network partition, and resolves the update conflicts using different algorithms.

Dynamo is a P2P system that uses gossip based protocol to share the membership and failure detection information with peers. The membership information contains the list of key ranges that the node is responsible for. Also nodes use gossip protocol to discover itself to other hosts. Basically, when a new dynamo host starts, it knows already a list of seed hosts, which it connects to tell them about own availability. Those seed hosts help to deliver the information via gossiping to other peers.

Dynamo uses consistent hashing for partitioning. This allows adding new nodes to the hash space without a need to re-hash existing values. Instead, only some part of values is moved to the new node, after what it becomes available for reads and writes.
Dynamo does automatic replication of the data by default to N hosts. When client writes a value, some kind of master node is selected based on the key. This master is the one that is responsible for the range to which the key belongs. The master node saves data locally, but also replicates it to the next N-1 hosts in the consistent hash ring. For example, is the master node A responsible for hash keys range A0-AF, and the key hash is A9, then node A would be picked as the master and called to write the value. The node A would store the value locally, but would also find the nodes X, Y that responsible for regions B0-BF and C0-CF respectively. After that A would call those nodes to save value on them as well. In this case, if for some reason node A fails, the Dynamo would try to connect the next node in the hash ring, which is X, and read the value from it. In case of failure, X would also become the master of the A0-AF hash keys range.

In case of partitioning or using different master nodes (nodes from the preference list), it is possible to end up with different latest versions of data on different nodes. Dynamo uses vector clocks to detect branching in the versions, finding common ancestor. Vector clock is basically a (node, counter) pair. If there are 2 objects with same counter but different nodes, this means that there is branching in the object versions.

There are 2 ways in Dynamo to merge the divergence in object version: syntactic reconciliation and semantic reconciliation. Syntactic reconciliation can be done by Dynamo. In case of semantic reconciliation, all found conflicted versions are returned to the client, and the client should do the reconciliation. Remark: this doesn't seem to be true for the DynamoDB, which, I think, just has better configuration of R and W quorum.

Dynamo does not need to have a load balancer, because every node is capable to handle get and put calls. That's because every node knows the information about the hosts in some key range preference list, thus is able to forward the call to them. This requires the additional hop, but at the same moment does not rely on load balancer as single point of failure (ok, maybe not always it's a SPOF). The node from preference, that handles the get/put call is called a coordinator node. Coordinator node would get/put data locally, but also would connect some number (R/W) of other nodes in the preference list to get the quorum on data.

Merkle trees are used to detect inconsistency between replicas. Merkle tree is a hash tree where leaves are hashes of the values of individual keys. This is how Merkle tree helps to reduce the amount of data passed between nodes, helps quickly find inconsistency in the tree (if parent node is different, then go compare its children, and so far till find what is the difference).

When node fails to connect to another node, it marks that node as failed and might exchange this information with other nodes using gossip based protocol. Thus more and more nodes know about the failure.

Each dynamo node has 4 components: request coordinator, membership and failure detector, local persistence engine and actual storage. Storage is done via MySQL or BerkleyDB.

State machines are actively used to handle different operations, including requests from consumers. For example, read operation creates a new state machine, that does all the job (ie finding the nodes by the key, sending requests, receiving and handling responses, detecting version mismatches and sending response to the consumer).

Also, Dynamo supports read repairs, where data are repaired on read (as opposite to be resolved only on write). In this case read operation coordinator will try to repair stale data on some hosts.

The common configuration for Dynamo instances is 3 hosts, quorum of reads and writes is 2. This gratifies the requirement to have R+W > N.

Dynamo tried 3 different strategies for distributing data between hosts:
  1. Strategy 1: T random tokens per node and partition by token value.
    This was initial implementation, where each node recieved a list of tokens. Tokens are chosen randomly, so the ranges may vary in size.
  2. Strategy 2: T random tokens per node and equal-sized partitions.
  3. Strategy 3: Q/S tokens per node, equal-sized partitions.
As result, the last one was chosen, because it achieves better efficiency and reduces the size of membership information that needs to be stored on each node. It also gives faster startup and recovery and ease of archival.

Dynamo also had moved the coordination state machine to the client side, so there was no need for load balancing. This also allows clients to remember the last accessed node, so the read operation following after write operation on same client will see just written data.

Each node performs different kinds of background tasks. For example, replicating data is a background task. Each background task has lower priority than foreground task - serving data. So, developers had to come up with some smart algorithm to perform well the background tasks, but do not block any get/put operations. Admission controller is used to monitor the node state: resource accesses, disk usage, latencies and failures.

In general, looks like Dynamo made a long way from the initial idea to the working solution. Throughout the document, it is clear, that a lot of important changes were done. For example, the strategy of adding new hosts and distributed requests changed a lot. I also noticed that the system moved from being per client configurable (number of hosts, read/write quorum size), to some best practice values, where number of hosts is 3 and read/write quorum size is 2.

I can assume that Dynamo is a predecessor for AWS DynamoDB. If so, this means that even more changes were made since the paper published. I bet that a lot of problems have too be solved yet before Dynamo could become publicly available by name of DynamoDB for any general client. For example, I don't remember any notice in DynamoDB documentation about returning multiple conflicting row for client to resolve (ie semantic reconciliation). Neither client now have a configuration choice for read/write quorum size. The document doesn't mention nothing about indexes, or capacity usage calculation and throttling, that are important part of the DynamoDB.

The paper shows how many different techniques and algorithms were used to build a distributed storage. For example, Merkle trees, Consistent Hashing, Vector Clocks, gossip protocols etc.

One can point out that Dynamo is just another DHT, and not the most advanced, as it goes with eventual consistency, unclear reconciliation etc. But, I think that it was solving very well the problems it faced. First of all, the paper mentions the need to always support write operation, ie be always available for clients. This means that system in any moment is able to work during network partitioning, means it is partition tolerant. Also it was design to be scalable automatically, without a need to add new hosts and migrating data manually. Also data are continuously replicated over a set of hosts, that makes data durable.

Another interesting node is that Dynamo team decided to go with existing data storage solutions like BerkleyDB or MySQL, instead of developing something new.

Also References section contains a list of really interesting works.