Showing posts with label cache. Show all posts
Showing posts with label cache. Show all posts

Simple Multiple BloomFilters data structure implementation on Java

As a follow up on my previous post about using Multiple Bloom Filters data structure to identify hot values, I decided to write a simple dumb implementation on Java. And open source it.

The project can be found on GitHub and is proudly named multi-bloom-filter.

It comes with basic little class called MultiBloomFilter which does most of the job. It accepts the number of enclosed bloom filters, capacity of each BF and duration before the the head BF will be reset. One can also specify what hash function is used and how many times hash function should be applied for each value.

Simple example:

This short example shows that MBF will reset only one of the internal BFs. Means, whenever reset happens, it will remove only part of the data, and whenever the hot key is added again, it would be identified as such.

Once again, MBF is a great solution if you need to find a set of hot values for some period of time. In particular, this helps to put only hot values into the cache. If we have many hosts that use a single distributed cache service, then using MBF might save from redundant traffic of putting cold data into the cache, where they would be evicted pretty fast. Also, as hot keys are in MBF, means there is a high chance they are in the distributed cache as well. Thus application has some kind of "bloom filter" to check what is the chance that value could be found in the cache for specified key.

There are much more use cases for the MBF data structure. Being able to work in concurrent scalable environment is another "feature" that I love about BloomFilters and MultiBloomFilter in particular. For me, good implementation of BloomFilter, that is able to grow and scale correctly, has different mechanism to evict data and fight the false positives, sounds as a very useful service.

Reverse Hash Tree

Reverse hash tree (RHT) is a tree data structure, where nodes contain a hash (ie not value only). I call this tree a "reverse" because the way node's hash value is computed is reversed to the Merkle tree way. In Merkle tree, parent nodes contain a hash of the child nodes. This allows a quick way to find the difference between 2 parent nodes in 2 trees - they would just have different hashes. The next step would be to find the child node that is actually differs.

In RHT, not a parent node hash is build based on child nodes hash, but in opposite: child node hash is build based on its own hash/value and parent node hash. For leaf nodes, value could be something useful, for intermediate nodes it could be just some version or category etc.
 
When child node it accessed, its hash is validated based on current value/hash and parent node hash. If the value is valid, we keep validating parent node hash with its parent node hash etc. This actually could be done on the way from root node to the child node. If found any node with inconsistent hash, this means the node and its children require a hash/data recalculation.

Identify hot data to cache using Multiple Bloom Filters

Recently was thinking how to support caching of hot data in highly loaded application with huge throughput. Here I'm describing a note about the idea, based on bloom filters, that I've ended up. Far from the ideal, and I haven't tested it yet. But at this moment, it seems optimal in terms of CPU and memory usage. Also, after some research over the internet, I found that idea is not new though (see link at the end of post.)

Bloom Filter (BF) gives a nice way to check if some value have been seen before. Of course, this is a probabilistic data structure, so result always goes with with some level of error.

Using multiple hash functions within single BF allows it to be better at avoiding conflicts, but in this case, it also requires the size of bloom filter to be larger. As an opposite to having one large BF and using multiple hash functions, it is possible to using multiple BFs, each relying only on single hash function, both have different size for each BF.

Multiple BFs could be also used to figure out if specified value has been seen for some number of times. As an example, imagine there is an application that accepts reads of data from storage. There is a need to store some data in the LRU cache, but because the number of different values is so large, it is impossible to just add every value to the cache. Cache limit + LRU policy will make sure that even hot data could be easily removed from cache. For example, assume we have a cache with N elements, and we reading from storage M elements, such that M is much much larger than N. If we would start adding every read element to the cache, we easily fill it with one-time read values, and the hot data would be evicted.

tmpfs: work with your data even faster

Currently improving IDEA performance by copying cache files into the RAM using tmpfs. Actually, tmpfs and ramfs are good ideas.

As described in Wikipedia: tmpfs is intended to appear as a mounted file system, but stored in volatile memory instead of a persistent storage device. Simply put, when you copy a file to such FS, this means that you copy a file to the RAM. When you create a file in this FS, this means you create a file in RAM. When you delete a file in this FS, this means you delete a file in a RAM. And so on.

The negative side of tmpfs is that it's not backed by any storage: on restart or system crash you'll lost your files and data stored in tmpfs. But on system start you can either copy files from the disk to the memory again and continue to work.

In case of IDEA cache, I will need to write a simple script that periodically copies the cache from tmpfs to the disk. So, on restart, I can simply restore cache, and don't need to wait while re-caching is done.

So, how to create a tmpfs storage? It's pretty easy to do with next commands. Here I create an empty directory tmp and mount it as tmpfs filesystem.
# mkdir ~/tmp
# mount -t tmpfs -o size=500m tmpfs ~/tmp
Option size=500m limits memory usage to 500m. tmpfs also supports flushing content to the swap when need. This is one of the main differences between tmpfs and ramfs. ramfs is not limited and can grow dynamically, and the used memory can't be freed or flush to swap. The negative side of such dynamic nature of ramfs is that system can hung when no free memory left.

To read from and write content to RAM is much much faster than to do the same operations with a file on disk. It's pretty good optimization if you need to support read-only content or content that can easily be restored when need.

Such type of content is a hosted website. You can decrease page or resource loading time by moving them from hard disk to the memory filesystem.
 # mkdir /var/www/www.example.com
 # mount -t tmpfs -o size=50M tmpfs /var/www/www.example.com
 # cp -R /home/www/www.example.com/* /var/www/www.example/com

To mount tmpfs automatically on system load, you will need to add another record to yours /etc/fstab configuration file. The only you need to do now is execute next command on system start:
 # cp -R /home/www/www.example.com/* /var/www/www.example/com

As a summary, tmpfs is a good way to work with large amount of files or data that need to be accessed or processed quickly. Such data also is either read-only or can be easily recreated. Samples of such data are static websites, website resources, temporary cache etc. When need to process large amount of data, you can also split it into small pieces and process each one by one. tmpfs is also takes a limited amount of RAM, and can increase over that amount.

"A Story of Caching"

I was playing with memcached a few days ago. Never used it before, so spent some time to read documentation and FAQ.

While reading FAQ, I found the link to the page with A Story of Caching, and decided to share link to this story here.

IMHO, this is the best non-technical technical documentation, that I've read recently. In this story, they use simple but widespread use cases of developing web application and using cache in this web application.

If you don't know what's for memcached, than read this story.
If you don't know what's the primary features of memcached, than read this story.
If you don't know if you need memcached, than read this story.

Simple LRU cache implementation.

By this url can be found post about using LinkedHashMap as LRU cache. Also, I've found such code with a direct link as comment in Apache Roller source code.
LinkedHashMap can be used as LRU cache because it supports ordering it's elements not only by insert order, but also by access-order. Also there is protected method removeEldestEntry in the LinkedHashMap that runs in the put() and putAll() methods. If that method returns true, than least recently inserted/accessed element will be removed.