Showing posts with label Scalable. Show all posts
Showing posts with label Scalable. Show all posts

Identify hot data to cache using Multiple Bloom Filters

Recently was thinking how to support caching of hot data in highly loaded application with huge throughput. Here I'm describing a note about the idea, based on bloom filters, that I've ended up. Far from the ideal, and I haven't tested it yet. But at this moment, it seems optimal in terms of CPU and memory usage. Also, after some research over the internet, I found that idea is not new though (see link at the end of post.)

Bloom Filter (BF) gives a nice way to check if some value have been seen before. Of course, this is a probabilistic data structure, so result always goes with with some level of error.

Using multiple hash functions within single BF allows it to be better at avoiding conflicts, but in this case, it also requires the size of bloom filter to be larger. As an opposite to having one large BF and using multiple hash functions, it is possible to using multiple BFs, each relying only on single hash function, both have different size for each BF.

Multiple BFs could be also used to figure out if specified value has been seen for some number of times. As an example, imagine there is an application that accepts reads of data from storage. There is a need to store some data in the LRU cache, but because the number of different values is so large, it is impossible to just add every value to the cache. Cache limit + LRU policy will make sure that even hot data could be easily removed from cache. For example, assume we have a cache with N elements, and we reading from storage M elements, such that M is much much larger than N. If we would start adding every read element to the cache, we easily fill it with one-time read values, and the hot data would be evicted.

Amazon Dynamo paper notes

Notes based on the paper about Amazon Dynamo. There is however a difference between Dynamo and AWS DynamoDB (at least based on name), however this is not discussed in the document.

Dynamo is designed to be an eventual consistent storage. At the same moment it allows clients to specify number of hosts in the read (R) and write (W) quorums. The number of hosts in the cluster N is also configurable. The good rule of thumb is to keep N >= 3, and R + W > N. In this case, only the correct data would be returned.

Many systems use synchronous data replication to keep consistency strong (e.g. MySQL). This causes issues with availability, b/c if the replica failed, it becomes unavailable to its clients. Only single master host could accept writes in such systems. In opossite to this, Dynamo uses asynchronous data replication algorithms. This helps to build partition tolerated systems, as Dynamo would accept reads and writes even during network partition, and resolves the update conflicts using different algorithms.

Dynamo is a P2P system that uses gossip based protocol to share the membership and failure detection information with peers. The membership information contains the list of key ranges that the node is responsible for. Also nodes use gossip protocol to discover itself to other hosts. Basically, when a new dynamo host starts, it knows already a list of seed hosts, which it connects to tell them about own availability. Those seed hosts help to deliver the information via gossiping to other peers.

Dynamo uses consistent hashing for partitioning. This allows adding new nodes to the hash space without a need to re-hash existing values. Instead, only some part of values is moved to the new node, after what it becomes available for reads and writes.

"A Story of Caching"

I was playing with memcached a few days ago. Never used it before, so spent some time to read documentation and FAQ.

While reading FAQ, I found the link to the page with A Story of Caching, and decided to share link to this story here.

IMHO, this is the best non-technical technical documentation, that I've read recently. In this story, they use simple but widespread use cases of developing web application and using cache in this web application.

If you don't know what's for memcached, than read this story.
If you don't know what's the primary features of memcached, than read this story.
If you don't know if you need memcached, than read this story.