I understand the basic guarantees of a bloom filter data structure.
Guarantees:
- A Negative result is guaranteed to be true.
- A Positive result has a probability, known as 'false positive probability', that it is a false Positive.
Let's take the usecase of checking if a user
has 'liked' an article or not. We can do this by checking if the concatenation of user.user_id
and article.article_id
is in the bloom filter.
The context of my question is: the vast majority of such pair of (user_id
, article_id
) will never be in the bloom filter, i.e. the base probability of any uniformly-random pair that is in the bloom filter is very small.
Using the bloom filter, we increase the probability of detecting a negative (true negative) from a near-1 probability to a definite 1. That is to say, for most of the time, we are computing if a random pair is NOT in the bloom filter for naught because most of the time, it wouldn't be.
My questions are: where exactly is the performance gain in using a bloom filter and perhaps, more generally, what are the criterions in a particular circumstance that would warrant the use of the bloom filter?