Identify hot data to cache using Multiple Bloom Filters

Recently was thinking how to support caching of hot data in highly loaded application with huge throughput. Here I'm describing a note about the idea, based on bloom filters, that I've ended up. Far from the ideal, and I haven't tested it yet. But at this moment, it seems optimal in terms of CPU and memory usage. Also, after some research over the internet, I found that idea is not new though (see link at the end of post.)

Bloom Filter (BF) gives a nice way to check if some value have been seen before. Of course, this is a probabilistic data structure, so result always goes with with some level of error.

Using multiple hash functions within single BF allows it to be better at avoiding conflicts, but in this case, it also requires the size of bloom filter to be larger. As an opposite to having one large BF and using multiple hash functions, it is possible to using multiple BFs, each relying only on single hash function, both have different size for each BF.

Multiple BFs could be also used to figure out if specified value has been seen for some number of times. As an example, imagine there is an application that accepts reads of data from storage. There is a need to store some data in the LRU cache, but because the number of different values is so large, it is impossible to just add every value to the cache. Cache limit + LRU policy will make sure that even hot data could be easily removed from cache. For example, assume we have a cache with N elements, and we reading from storage M elements, such that M is much much larger than N. If we would start adding every read element to the cache, we easily fill it with one-time read values, and the hot data would be evicted.

In this case, there is a need to have some kind of filter on the application side, so only often read data are cached. For service with high throughput of really diverse data, it is hard to implement this filter with acceptable memory usage. First guess is to keep a hash map of keys to counters and reset this map time-to-time, may seem like a normal solution, but if the load is thousands requests per second, the memory would be trashed pretty fast.

Using BF is an alternative to the map, and it allows to achieve compact memory usage. However, single BF has the problem with being trashed too fast. Also it does not answer the question, whether some key was used often-enough to be cached.

BF of larger size could solve the problem of being trashed too fast. Resetting the BF time-to-time could also help here. As an alternative, multiple BFs could be used as described above.

Using counts for each set in the BF could help to answer the question how many. But it won't help much. Because every conflict would increase the counter value. Using multiple BFs could help here as well.

Assume, that for every key we are moving over multiple BFs in the round-robin way. In order, to answer if some key was actually used K times over some period of time, we use K bloom filters. To support a time aspect, we reset one of the bloom filters after some period T passed. As we have a list of BFs, we reset only the tailing bloom filter at any moment, and move it to the beginning of the list. Thus we always move it in a circular way.

There is always a possibility of conflicts, especially when we are using same hash function and same size for each BF in the list. There are 2 alternatives: either use different hash functions for each BF (or use the same hash function but accept another parameter custom for each BF to avoid repeating conflicts), or have a different size for each BF.

Another problem is how to know what is the next BF should be updated for any key. There always exists an option to start from the first BF in the list and keep checking till find the one that has no element yet. For large K this could take a time, assuming that either different hash functions or BF sizes might be used. An alternative to this, is to use the binary search to find BF without a key yet. For every key, check the first BF, as it could be zero out, means the period T elapsed and we might want to start from the beginning. If 1st BF "knows" about the key, it is possible to check the last one or the middle BF. And keep checking in the binary search style till we find the latest BF that has the key. Then just update the next BF in the list.

One of the challenges in this algorithm is to support the probabilistic nature of bloom filters. However, the more of BF we have, the less error rate should be. Another challenge is to support concurrent updates for the same key, as in this key we always have a risk to miss some update.

Being able to calculate the size of BF, number of BFs K and time-to-reset T values is another problem. It might require not only analytical approach, but some way of tuning the behavior. Computing those values becomes even more complicated whenever we assume we have multiple nodes each running this code. If we have a requirement to put the value for key after it have been seen for M times, we should care that this functionality might be distributed over some number of hosts H.

Also found a paper recently about this: https://www.usenix.org/legacy/event/fast11/posters_files/Park_D.pdf.