I have a lot of data that needs fuzzily queried very quickly (which, of course, is relative, but within the magnitude of a few seconds for ~100 million keys, but ideally faster). It's in the form of key/value pairs, where keys are a unique string and values are an array of strings. This is the structure of the actual data, but I'm free to organize it into whatever data structures make lookup faster.
A lookup of values for a key should include not only the values for that exact key, but also the values for all keys within a levenshtein distance within a predetermined threshold (say, 5).
For example, a lookup of hello
should not only return all values indexed under the key for hello
, but also those for Hello
, hello!
, yello
, helo
, hellooo
, etc.
The naive solution, of course, is iterating over each key, calculating its levenshtein distance, and including its values if it is within a certain threshold. However, this solution does not scale well with O(n) time complexity to iterate over each key, O(n-1) to iterate over each key to compare it to, and O(n) per levenshtein calculation, resulting in a lookup time of O(n * n * n-1) which, of course, is unacceptable.
How can I structure this data to optimize lookup time complexity? Space complexity, insert, delete, and edit runtimes are all irrelevant (though, I'd prefer to keep inserts under a second each to avoid bottlenecks there also).
Some information on the data:
- Unique keys are ready to be added at a fairly constant rate of 10/second
- Strings are ready to append to value arrays at a fairly constant rate of 10/second
- The size of each key's value is typically 1-5 elements, but some outliers have hundreds
- Each string in the value arrays are typically 20-40 characters in length