3

I have a lot of data that needs fuzzily queried very quickly (which, of course, is relative, but within the magnitude of a few seconds for ~100 million keys, but ideally faster). It's in the form of key/value pairs, where keys are a unique string and values are an array of strings. This is the structure of the actual data, but I'm free to organize it into whatever data structures make lookup faster.

A lookup of values for a key should include not only the values for that exact key, but also the values for all keys within a levenshtein distance within a predetermined threshold (say, 5).

For example, a lookup of hello should not only return all values indexed under the key for hello, but also those for Hello, hello!, yello, helo, hellooo, etc.

The naive solution, of course, is iterating over each key, calculating its levenshtein distance, and including its values if it is within a certain threshold. However, this solution does not scale well with O(n) time complexity to iterate over each key, O(n-1) to iterate over each key to compare it to, and O(n) per levenshtein calculation, resulting in a lookup time of O(n * n * n-1) which, of course, is unacceptable.

How can I structure this data to optimize lookup time complexity? Space complexity, insert, delete, and edit runtimes are all irrelevant (though, I'd prefer to keep inserts under a second each to avoid bottlenecks there also).

Some information on the data:

  • Unique keys are ready to be added at a fairly constant rate of 10/second
  • Strings are ready to append to value arrays at a fairly constant rate of 10/second
  • The size of each key's value is typically 1-5 elements, but some outliers have hundreds
  • Each string in the value arrays are typically 20-40 characters in length
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
drusepth
  • 393
  • 3
  • 14

1 Answers1

1

I'm not sure about the data structure being used here, but the easiest option might be to use ElasticSearch's fuzzy query or its relatives. Well fundamentaly it is using an inverted index with good optimizations for executing fuzzy queries.

NikoNyrh
  • 3,578
  • 2
  • 18
  • 32