Random(Notes): August 2014

Simple Multiple BloomFilters data structure implementation on Java

As a follow up on my previous post about using Multiple Bloom Filters data structure to identify hot values, I decided to write a simple dumb implementation on Java. And open source it.

The project can be found on GitHub and is proudly named multi-bloom-filter.

It comes with basic little class called MultiBloomFilter which does most of the job. It accepts the number of enclosed bloom filters, capacity of each BF and duration before the the head BF will be reset. One can also specify what hash function is used and how many times hash function should be applied for each value.

Simple example:

This short example shows that MBF will reset only one of the internal BFs. Means, whenever reset happens, it will remove only part of the data, and whenever the hot key is added again, it would be identified as such.

Once again, MBF is a great solution if you need to find a set of hot values for some period of time. In particular, this helps to put only hot values into the cache. If we have many hosts that use a single distributed cache service, then using MBF might save from redundant traffic of putting cold data into the cache, where they would be evicted pretty fast. Also, as hot keys are in MBF, means there is a high chance they are in the distributed cache as well. Thus application has some kind of "bloom filter" to check what is the chance that value could be found in the cache for specified key.

There are much more use cases for the MBF data structure. Being able to work in concurrent scalable environment is another "feature" that I love about BloomFilters and MultiBloomFilter in particular. For me, good implementation of BloomFilter, that is able to grow and scale correctly, has different mechanism to evict data and fight the false positives, sounds as a very useful service.

SLA

SLA (Service Layer Agreement) for large distributed software is very important. It plays the role of contract between the application and it's clients. It's also not very straightforward to achieve, especially if many components participate in the process. Each component in this case should have even more strict SLA, because the result of each component SLA would sum up. For example, if single call to some service A is resulted in multiple calls to other services, it's important that other services had better SLA in order to keep service A SLA promise.

There might be different types of SLA: resources, availability, durability and performance.
Many systems provides contract on how many resources are available for the client: memory, CPU, disk space etc.
Some websites and services say that their availability SLA is 99.9%, which means that 99.9% of time they going to be available. Actually, that is not very much at all. There is a nice table on Wikipedia with conversion between availability percentage and actual time.

Some services, especially the storage services like S3, have also durabitlity SLA. This to say how rarely the service might lost the data.

Performance SLA is common for running services that need to not only be available, but return response on request within specified period of time. For performance SLA, it is always common to use some percentile of requests that would be handled within SLA. For instance, it might be said that SLA is return response in 10 ms or less for 99.9% of requests.

Twister: A Simple Way to Manage Your Scripts

Imagine an average project that has many scripts, each written using different practices, uses different argument names, different namings, does something similar to other script but a bit different etc. Sometimes there are so many scripts, that it's hard to find the one you really need at this very moment. Moreover, scripts have no standard location, often put in semi-random directories, so it's really hard to find them. And even more, many developers have similar scripts for different projects. Some scripts are in the PATH, others are relative to the project directory. The one in the PATH are named in odd manner, because different versions used for different projects. And some scripts are written using bash and ruby and python etc.

Oh, so many troubles just because of the scripts. That's the reason why Twister was created. Twister is a simple framework/tool that allows to create and manage project scripts. I use it in a few of my home projects, and find it very helpful. About a year ago, I open sourced Twister. It's a python project, so it makes it simple to create scripts and execute them on any popular operating system.

keep reading »

Reverse Hash Tree

Reverse hash tree (RHT) is a tree data structure, where nodes contain a hash (ie not value only). I call this tree a "reverse" because the way node's hash value is computed is reversed to the Merkle tree way. In Merkle tree, parent nodes contain a hash of the child nodes. This allows a quick way to find the difference between 2 parent nodes in 2 trees - they would just have different hashes. The next step would be to find the child node that is actually differs.

In RHT, not a parent node hash is build based on child nodes hash, but in opposite: child node hash is build based on its own hash/value and parent node hash. For leaf nodes, value could be something useful, for intermediate nodes it could be just some version or category etc.

When child node it accessed, its hash is validated based on current value/hash and parent node hash. If the value is valid, we keep validating parent node hash with its parent node hash etc. This actually could be done on the way from root node to the child node. If found any node with inconsistent hash, this means the node and its children require a hash/data recalculation.

keep reading »