Could someone explain me, in a simple way, the d-left counting Bloom filter and in particular the use of fingerprints and remainders?
And is there a good Python implementation of it?
Could someone explain me, in a simple way, the d-left counting Bloom filter and in particular the use of fingerprints and remainders?
And is there a good Python implementation of it?
Intuitively speaking, a d-left counting Bloom filter (or dlcBF for short) is a variation on Bloom filters that's designed to support both insertions and deletions.
A regular Bloom filter lets you add in new items after the filter has been created. To do so, simply hash your new item x with each of the hash functions, go to the bits in question, and set them equal to 1. However, Bloom filters don't naturally support deletions after items have already been added. Specifically, you can't just flip each of the bits an item hashes to to 0, because other items might hash to the same positions.
The dlcBF is a (relatively) space-efficient way to provide the same functionality as a Bloom filter, but with the ability to support deletions as well as additions. The dlcBF uses roughly 1.5-2x as much space as a classical Bloom filter.
Since the dlcBF was invented in 2006, newer data structures that mimic Bloom filters with dynamic insertions and deletions have also been invented. The cuckoo filter, first described in 2014, supports the same set of operations but with a significantly lower memory overhead, assuming a reasonably low error rate. If you're just interested in a Bloom filter replacement that does inserts and deletes, I'd recommend opting for a cuckoo filter, for which there are many good implementations available.
As for how the dlcBF works, there are two basic principles governing its operation. The first idea is buckets and slots. The dlcBF works by maintaining an array of b buckets. The buckets are grouped into d groups of (roughly) b/d buckets each. Each bucket is in turn subdivided into s slots, where each slot consists of a two-bit counter and space to store a small number of bits.
The values stored in the slots within each bucket are derived using a technique called fingerprinting. Each item stored in the dlcBF is hashed to a number which we'll call a fingerprint. The key source in turn is used to derive the locations of d buckets and a small bit sequence called a remainder for each bucket. To see if an item is present in the dlcBF, you compute its key source, then derive d pairs of buckets and remainders. Next, check each of the indicated buckets to see if the particular remainder is stored in one of the slots in the bucket. If not, the item isn't present. If so, the item either is present, or we have a false positive (like in a regular Bloom filter).
The d-left counting bit comes up when adding an item. To add an item, compute its key source, then get the d bucket/remainder combinations. Find whichever of those buckets has the fewest remainders in it, breaking ties to the left (e.g. lower bucket indices), then either add the remainder if it's not in any of that bucket's slots, or increment the associated counter if the remainder is present. To delete an item, simply search for it as usual, and if you find a bucket/remainder sequence that matches your item, decrement its counter.
The analysis of the dlcBF boils down to showing that (1) this strategy of placing items in buckets with few items, tiebreaking to the left, tends to spread items out close to uniformly, then (2) working out the math about how many buckets you need, how long your fingerprints should be, and how many bits you need per item. Intuitively, making the fingerprint bigger reduces the likelihood of a hash collision between an item stored in the table and one not stored in the table, but increases the amount of space needed. Decreasing the fingerprint size reduces the size of the table, but also increases the false positive rate.