Reverse Hash Tree

Reverse hash tree (RHT) is a tree data structure, where nodes contain a hash (ie not value only). I call this tree a "reverse" because the way node's hash value is computed is reversed to the Merkle tree way. In Merkle tree, parent nodes contain a hash of the child nodes. This allows a quick way to find the difference between 2 parent nodes in 2 trees - they would just have different hashes. The next step would be to find the child node that is actually differs.

In RHT, not a parent node hash is build based on child nodes hash, but in opposite: child node hash is build based on its own hash/value and parent node hash. For leaf nodes, value could be something useful, for intermediate nodes it could be just some version or category etc.
 
When child node it accessed, its hash is validated based on current value/hash and parent node hash. If the value is valid, we keep validating parent node hash with its parent node hash etc. This actually could be done on the way from root node to the child node. If found any node with inconsistent hash, this means the node and its children require a hash/data recalculation.

 Because of this hash key organization, it is possible to change a value/hash for some specific parent node, and it will require all child node to re-calculate its hash based on the parent node hash. This gives an easy way to invalidate large amount of data, without actually doing this for every specific record. Here invalidate doesn't mean the same as remove data. As nodes in the tree stay untouched, they just will need to recalculate its hash. Re-calculating hash may include additional responsibilities, like check if nodes value is still valid/fresh or may require re-computation on its own.

This makes reverse hash trees a valuable structure for caches. Instead of removing part of the tree from cache, it is just marked as invalidated. Old records are still accessible, so could be used when its required (for example, getting a fresh record takes too much time). Because old records stay in the tree, another process could be used in parallel that could do refreshing data and hash.

Keeping records in RHT, when parent node version changes, allows to save system from extra work. For example, assume that parent node is changed b/c the environment changes, and new child records might be added, some old records could be removed. Also, calculating the child node value is expensive operation, but checking if value should be removed is not. RHT will allow to find the child nodes that need to be re-evaluated (removed or not?), but after that will save a lot of time, as no need to calculate value for the present nodes. Only new nodes will require additional work.

No comments: