Leader Election: Gallager-Humblet-Spira (GHS) algorithm
Algorithm for building MST is very similar to the Kruskal's algorithm, although it has some specifics for the nodes distributed in the arbitrary networks. For example, the nodes have to communicate with each other in order to detect connected components, merge etc. After the MST is built, one of the nodes is chosen to be a leader. The notification is sent to all other nodes in the MST.
Basically, the goal of this algorithm is to identify the nodes in the network, and the elect a leader in the known field.
Leader Election: LeLann, Chang and Roberts algorithm (LCR)
Algorithm complexity is better than in bully algorithm, because its worst case is
O(N^2)
, which happens only if hosts are organized into the network using by increasing (or decreasing, depends on the what extreme is used) order of their UIDs.Leader Election: Bully Algorithm
The are 3 types of messages:
- election message - sent to announce election
- answer message - response to the election message
- coordinator message - sent to announce the identity of the elected leader
- Some process
P
finds that leader is not available (or coordinator is down). P
initiates a new leader election by broadcasting election message.- Each process replies with own UID
- If
P
finds a process with higher UID, it waits a bit and if there is no reply with higher UID, it will promote that host to leader. - However, if
P
receives a coordinator message with new leader, which UID is lower thanP
's UID, it will initiate another election.
N^2 + N
messages, as at any moment any process can try to start election, thus send election message, handle N answer messages and send coordinator message. As an optimization, the process can send messages to the processes with larger UID only, this will give us N^2 / 2 + N
messages, which is better, but not ideal.
Leader Election in Distributed System
There are multiple leader election algorithms. Almost always, for most of the tasks it's possible to find the solution where there is no need of single leader:
- for task scheduling the list of tasks, tasks could be partitioned and with locking spread over multiple hosts
- for making final decision, it's possible to come up with some consensus (like paxos algorithm)
- distributed locking could help other systems to avoid having a leader (again it might be based on some consensus algorithm like paxos or raft)
Amazon Dynamo paper notes
Dynamo is designed to be an eventual consistent storage. At the same moment it allows clients to specify number of hosts in the read (
R
) and write (W
) quorums. The number of hosts in the cluster N
is also configurable. The good rule of thumb is to keep N >= 3
, and R + W > N
. In this case, only the correct data would be returned.Many systems use synchronous data replication to keep consistency strong (e.g. MySQL). This causes issues with availability, b/c if the replica failed, it becomes unavailable to its clients. Only single master host could accept writes in such systems. In opossite to this, Dynamo uses asynchronous data replication algorithms. This helps to build partition tolerated systems, as Dynamo would accept reads and writes even during network partition, and resolves the update conflicts using different algorithms.
Dynamo is a P2P system that uses gossip based protocol to share the membership and failure detection information with peers. The membership information contains the list of key ranges that the node is responsible for. Also nodes use gossip protocol to discover itself to other hosts. Basically, when a new dynamo host starts, it knows already a list of seed hosts, which it connects to tell them about own availability. Those seed hosts help to deliver the information via gossiping to other peers.
Dynamo uses consistent hashing for partitioning. This allows adding new nodes to the hash space without a need to re-hash existing values. Instead, only some part of values is moved to the new node, after what it becomes available for reads and writes.
Merkle trees
Merkle tree is a hash tree, where leaves are hashes. And each intermediate node is a hash of its children. Root node is a hash of values in its children node. Thus to know the difference between two hash trees, it's just enough to compare the root nodes of those trees. It's also easy to find the actual difference between two trees: just traverse them top-to-bottom and find the nodes with different hashes. This is how Merkle tree helps to reduce the amount of data passed between two sources.
Merkle trees are used as anti-entropy mechanism that helps to detect cases of divergence between data stored in multiple location. Merkle trees, for example, are used by Dynamo and Cassandra to find the cases when replicas have different versions of data. Merkle trees are also an awesome way to detect changes before syncing up data, which is one of the primary tasks for Dropbox, Chrome Backup, Google Drive, BitTorrent Sync etc.
Merkle trees do not rely on timestamps, but on actual data differences. This makes hash trees a wonderful structure to synchronize data between source and target with minimal effort of detecting actual changes.
For example, assume that there is a backend B
, and N
clients, each client maintains copy of the data (like Dropbox, BitTorrent Sync, Google Drive etc.) And some of the clients changed the data locally, and uploads them to backend B
. Then each client can determine the actual change and do exchange of only necessary data with backend.
Hash trees are used by Git and Bitcoin. As I mentioned before, Dynamo, Cassandra and Riak are using Merkle trees for anti-entropy as well.Merkle trees are also used in cryptography for message signatures (search for Lamport signature).
Links:
- https://en.wikipedia.org/wiki/Merkle_tree
- http://distributeddatastore.blogspot.com/2013/07/cassandra-using-merkle-trees-to-detect.html
- pdf.aminer.org/000/309/152/optimal_trade_off_for_merkle_tree_traversal.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.8626&rep=rep1&type=pdf
- http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees
Onion Routing
Before, I quite couldn't understand how anonymity could be reached in the modern networks. However, onion routing is actually quite simple but powerful way to do that. All genius is simple!
Gossip Protocol
Gossip protocol is a style of inter-process communication in the network, inspired by gossips that happen in a human social groups. Basically, nodes (peers) exchange some information. This information may contain the knowladge that node gain by itself (eg some events), received from user (eg search queries) or received from other nodes during conversation with them.
Peers are connecting to each other in a random manner to ensure that information is shared pretty quickly. This could be like 10 times per second that one peer connects to another and exchanges the information.
It's also possible to organize the groups of hosts where each node share the information only with nodes within this group. This simplifies finding next random node to communicate, and limits the number of random useless connections between nodes. Some nodes are shared between multiple groups, to ensure that information is spread over the whole network. As an alternative, nodes could communicate only with it's near neighbors. Another alternative is to define the hierarchical levels, thus minimize the need to send message to everyone, but instead only some hosts in the specified level.
Gossip protocol unfortunately does not not guarantee that knowledge would be shared between all nodes. Another issue is that the received information might be not correct or valid anymore (well, that's in the nature of gossips, right?). Also it is said that gossip protocol maintains relaxed consistency.
Another good usage of gossips is to multicast information over the network without known structure. For example, it's not always possible to multicast information using multicast IP, because hosts might be organized into different restricted networks.
Gossip protocols are in the core of the internet.
When using gossip protocols it's better to share information about the fact than the fact itself, ie configuration was changed instead of new value for property A is B. This is also true in many distributed systems.
Usages
There are multiple usages for gossip protocol, some of them are:
- propagate events between multiple hosts, so each node could react if need
- could be used to route data using different paths (get the data, split on packages, deliver to the target using different routes)
- could be used to aggregate values among different hosts (each node receive the data, check if it was seen, if not - aggregate with own data and send forward to next hosts)
- search among large amount of nodes (used in unstructured P2P networks)
- finding the list of nodes or just counting nodes in the network of unknown size and structure
- electing leaders
- sorting nodes in the network, detecting failures and errors, recovering the network structure
- used for monitoring and gathering data, for example for load balancing
- can help with detecting network partitions and dynamic insertions of new services
- heartbeat implementation
- implementing epidemic algorithms in the network
- gather information about other nodes to make local decision, like load balancing, accessing data located in specific nodes etc.; do this in scalable way
- detecting new nodes and additing them to the cluster or load balancer, updating DNS records etc.
But usually gossip protocols are not the most efficient one. It's pretty often possible to find some specific protocol that does the job better. However, some other protocols may not work for large networks (ie paxos), while others are centralized thus might have failure issues (ie based on centralized db).
Another downside of gossip protocols are:
same information can be delivered same host twice
message latency is usually high especially for large networks