How to implement a fast fuzzy-search engine using BK-trees when the corpus has 10 billion unique DNA sequences?

Question

I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.

Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.

I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.

Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.

I use pybktree.py and my queries are intended to find all entries within an edit distance d.

Is there some architecture or database which will allow me to store those trees?

Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).

Does this help: https://stackoverflow.com/questions/10052105/how-optimize-bk-tree — Paddy3118, Jan 06 '21 at 06:12
Do you run out of memory? If so get more memory as parralel instances would need even more memory on the same machine. This lib is a wrapper over a C implementation that might have different operating characteristics: https://github.com/TeamHG-Memex/py-bkstring (untried by me) — Paddy3118, Jan 06 '21 at 06:19
Wow, 10B entries - that's a lot! Even if it performed well, that's going to use an awful lot of memory (you may need a more memory-efficient way to store them than high-overhead Python objects). What is the value of *n* in your `find()` call? High *n* values can be slow, and query speed also depends on the "shape" of the values. Here's a detailed performance analysis: https://github.com/benhoyt/pybktree/issues/5 — Ben Hoyt, Jan 06 '21 at 06:48
@BenHoyt thanks for your GitHub module and feedback. Those are DNA sequences for which I would like to consider also the option of insertion and deletion in the distance function, as implemented in `fuzzywuzzy`. The distance that I usually seek for is up to 6. Where each entry is at fixed length of 20 DNA letters over the alphabet A, T, C, and G — 0x90, Jan 06 '21 at 06:54
@Paddy3118 not really for two reasons: I have 10,000x more values, and my distance function supports insertion and deletions based on `fuzzywuzzy` memorywise since it's short DNA sequences we can include them as 48bit integers, but even as strings they don't consume so much memory. — 0x90, Jan 12 '21 at 08:23

maxbachmann · Accepted Answer · 2021-02-12T23:55:57.057

FuzzyWuzzy

Since you are mentioning your usage of FuzzyWuzzy as distance metric I will concentrate on efficient ways to implement the fuzz.ratio algorithm used by FuzzyWuzzy. FuzzyWuzzy provides the following two implementations for fuzz.ratio:

difflib, which is completely implemented in Python
python-Levenshtein which uses a weighted Levenshtein distance with the weight 2 for substitutions (substitutions are deletion + insertion). Python-Levenshtein is implemented in C and a lot faster than the pure Python implementation.

Implementation in python-Levenshtein

The implementation of python-Levenshtein uses the following implementation:

removes common prefix and suffix of the two strings, since they do not have any influence on the end result. This can be done in linear time, so matching similar strings is very fast.
The Levenshtein distance between the trimmed strings is implemented with quadratic runtime and linear memory usage.

RapidFuzz

I am the author of the library RapidFuzz which implements the algorithms used by FuzzyWuzzy in a more performant way. RapidFuzz uses the following interface for fuzz.ratio:

def ratio(s1, s2, processor = None, score_cutoff = 0)

The additional score_cutoff parameter can be used to provide a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. This can be used by the implementation to use more a more optimized implementation in some cases. In the following I will describe the optimizations used by RapidFuzz depending on the input parameters. In the following max distance refers to the maximum distance that is possible without getting a ratio below the score threshold.

max distance == 0

The similarity can be calculated using a direct comparison, since no difference between the strings is allowed. The time complexity of this algorithm is O(N).

max distance == 1 and len(s1) == len(s2)

The similarity can be calculated using a direct comparisons as well, since a substitution would cause a edit distance higher than max distance. The time complexity of this algorithm is O(N).

Remove common prefix

A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity. This step is performed for any of the following algorithms.

max distance <= 4

The mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold max distance. A description of the original algorithm can be found here. I changed this algorithm to support the weigth of 2 for substitutions. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).

len(shorter string) <= 64 after removing common affix

The BitPAl algorithm is used, which calculates the Levenshtein distance in parallel. The algorithm is described here and is extended with support for UTF32 in this implementation. The time complexity of this algorithm is O(N).

Strings with a length > 64

The Levenshtein distance is calculated using Wagner-Fischer with Ukkonens optimization. The time complexity of this algorithm is O(N * M). This could be replaced with a blockwise implementation of BitPal in the future.

Improvements to processors

FuzzyWuzzy provides multiple processors like process.extractOne that are used to calculate the similarity between a query and multiple choices. Implementing this in C++ as well allows two more important optimizations:

when a scorer is used that is implemented in C++ as well we can directly call the C++ implementation of the scorer and do not have to go back and forth between Python and C++, which provides a massive speedup
We can preprocess the query depending on the scorer that is used. As an example when fuzz.ratio is used as scorer it only has to store the query into the 64bit blocks used by BitPal once, which saves around 50% of the runtime when calculating the Levenshtein distance

So far only extractOne and extract_iter are implemented in Python, while extract which you would use is still implemented in Python and uses extract_iter. So it can already use the 2. optimization, but still has to switch a lot between Python and C++ which is not optimal (This will probably be added in v1.0.0 as well).

Benchmarks

I performed benchmarks for extractOne and the individual scorers during the development that shows the performance difference between RapidFuzz and FuzzyWuzzy. Keep in mind that the performance for your case (all strings length 20) is probably not as good, since many of the strings in the dataset used are very small.

The source of the reproducible-science DATA :

words.txt ( dataset with 99171 words )

The hardware the graphed benchmarks were run on (specification) :

CPU: single core of a i7-8550U
RAM: 8 GB
OS: Fedora 32

Benchmark Scorers

The code for this benchmark can be found here

Benchmark extractOne

For this benchmark the code of process.extractOne is slightly changed to remove the score_cutoff parameter. This is done because in extractOne the score_cutoff is increased whenever a better match is found (and it exits once it finds a perfect match). In the future it would make more sense to benchmark process.extract which does not has this behavior (the benchmark is performed using process.extractOne, since process.extract is not fully implemented in C++ yet). The benchmark code can be found here

This shows that when possible the scorers should not be used directly but through the processors, that can perform a lot more optimizations.

Alternative

As an Alternative you could use a C++ implementation. The library RapidFuzz is available for C++ here. The implementation in C++ is relatively simple as well

// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());

rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
for (const auto& choice : choices)
{
  results.push_back(scorer.ratio(choice));
}

or in parallel using open mp

// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());

rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);

#pragma omp parallel for
for (const auto& choice : choices)
{
  results.push_back(scorer.ratio(choice));
}

On my machine (see Benchmark above) this evaluates 43 million words/sec and 123 million words/sec in the parallel version. This is around 1.5 times as fast as the Python implementation (due to conversions between Python and C++ Types). However the main advantage of the C++ version is that you are relatively free to combine algorithms whichever way you want, while in the Python version your forced to use the process functions that are implemented in C++ to achieve good performance.

Worth to mention the `polyleven` [library](https://github.com/fujimotos/polyleven). — 0x90, Jan 29 '21 at 11:21
`polyleven` implements a normal Levenshtein distance and not this weighted version. RapidFuzz provides a implementation of the normal Levenshtein distance aswell (`rapidfuzz.string_metric.levenshtein`). A performance comparision can be found here: https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein — maxbachmann, Mar 07 '21 at 17:23
do you mind to share a working example with the weighted version in python? — 0x90, Mar 07 '21 at 17:24
In the docs I linked in my previous comment you can find some examples in the bottom (`string_metric.levenshtein("lewenstein", "levenshtein", weights=(1,1,2))`). This weighted version is also known as InDel distance, since it will only use Insertions/Substitutions — maxbachmann, Mar 07 '21 at 17:32
I think that the weighted version doesn't obey the triangle inequality — 0x90, Mar 07 '21 at 17:35
gaps at start/end are not supported for now. The weighted version might not obey the triangle inequality depending on the weights that are used. The case Insertion/Deletion=1, Substitution=2 is similar to a edit distance, that only allows Insertions and Deletions. I think this should still obey the triangle inequality. — maxbachmann, Mar 07 '21 at 17:51

Shamis · Answer 2 · 2021-01-18T12:18:33.963

Few thoughts

BK-trees
Kudos to Ben Hoyt and his link to the issue which I will draw from. That being said, the first observation from the mentioned issue is that the BK tree isn't exactly logarithmic. From what you told us your usual d is ~6, which is 3/10 of your string length. Unfortunately, that means that if we look at the tables from the issue you will get the complexity of somewhere between O(N^0.8) to O(N). In the optimistic case of the exponent being 0.8(it will likely be slightly worse) you get an improvement factor of ~100 on your 10B entries. So if you have a reasonably fast implementation of BK-trees it can still be worth it to use them or use them as a basis for a further optimization.

The downside of this is that even if you use 1000 trees in parallel, you will only get the improvement from the parallelization as the perfomance of the trees depends on the d rather than on the amount of the nodes within the tree. However even if you run all the 1000 trees at once with a massive machine, we are at the ~10M nodes/tree which you reported as slow. Still, computation wise, this seems doable.

A brute force approach
If you don't mind paying a little I would look into something like Google cloud big query if that doesn't clash with some kind of data confidentiality. They will brute force the solution for you - for a fee. The current rate is $5/TB of a query. Your dataset is ~10B rows * 20chars. Taking one byte per char, one query would take 200GB so ~1$ per query if you went the lazy way.
However, since the charge is per byte of a data in a column and not per complexity of a question, you could improve on this by storing your strings as bits - 2bits per a letter, this would save you 75% of the expenses.
Improving further, you can write your query in such a way that it will ask for a dozen strings at once. You might need to be a bit careful to use a batch of similar strings for the purpose of the query to avoid clogging of the result with too many one-offs though.

Brute forcing of the BK-trees
Since if you go with the route above, you will have to pay depending on the volume, the ~100-fold decrease in the computations needed becomes ~100-fold decrease in price which might be useful, especially if you have a lot of queries to run.
However you would need to figure out a way to store this tree in a several layers of databases to query recursively as the Bigquery pricing depends on the volume of the data in the queried table.
Building a smart batch engine for recursive processing of the queries to minimize the costs could be fun optimization excercise.

A choice of language
One more thing. While I think that Python is a good language for fast prototyping, analysis and thinking about code in general you are past that stage. You are currently looking for a way to do a specific, well defined and well thought operation as fast as possible. Python is not a great language for this as this example shows. While I used all the tricks I could think of in Python, the Java and C solutions were still several times faster. (Not to mention the rust one that beat us all - but he beat us by algorithm as well so it's hard to compare.) So if you go from python to a faster language, you might gain another factor or ten or maybe even more of a performance gain. This could be another fun optimization exercise.
Note: I am being rather conservative with the estimate as the fuzzywuzzy already offers to use a C library in the background so I'm not too sure about how much of the work still depends on the python. My experience in similar cases is that the performance gain can be factor of 100 from pure python(or worse, pure R) to a compiled language.

Amazon Athena looks fairly similar and the price appears to be the same. Alternatively you could look into Amazon Redshift. However that one seems to be for a bit different usecase. I didn't run the pricing math. — Shamis, Jan 18 '21 at 14:34
it's not trivial how to implement it over redshift since each word map into 100,000 other words on average and this will explode, maybe some graph DB. — 0x90, Jan 18 '21 at 14:35
In that case Amazon Athena - with the price of $1.25/query if you use memory-saving representation of your data and just brute force the solution. Less if you go through the hoops of implementing several db layers to cut off some possibilities. — Shamis, Jan 18 '21 at 14:37
Additionaly, if a word maps into ~100K words, you could use a combined query of let's say 10 words for getting a ~1M candidate list which can be processed locally at a reasonable speed. This would make it easier for your wallet and also for your machine. — Shamis, Jan 18 '21 at 14:39
I also thought to use elasticsearch due to its lucene engine, but out-of-the-box it supports edit distance of 2 at most — 0x90, Jan 18 '21 at 14:39
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227495/discussion-between-shamis-and-0x90). — Shamis, Jan 18 '21 at 14:40

score 1 · Answer 3 · answered May 15 '22 at 10:13

Quite late to the party, but here is a possible solution which I would implement if I were in your situation:

Save the dataset as text file, and put that file on a very fast disk region (preferably on tmpfs).
Prepare a beefy computer with many physical CPU cores (such as Threadripper 3990X that has 64 cores).
Use this implementation and GNU parallel to grok the dataset.

Here is a bit of technical info behind this solution:

The optimized version of Myers' algorithm (linked above) can process about 14 million entries per sec on a single CPU core.
If you can fully utilize all the 64 physical cores, you can archive the throughput of 896 million per sec (= 14m * 64 cores).
At this speed, you can perform a single query on 10 billion datasets in 12 seconds using a single machine.

I posted more detailed analysis at this article. As shown in the article, I could perform a query against a dataset of 100 million records in 1.04s with my cheap desktop machine.

By using a more performant CPU (or splitting the task between multiple computers), I believe you can archive the desired result. Hope this helps.