4

I need to find a set of substrings (each about 32 characters) in a very large string ( about 100k) as fast as possible. I need the search to be fuzzy.

What is the best algorithm? I tried scanning whole big string for small strings and checking Levenshtein Distance for each step, but it takes lots of time.

AVEbrahimi
  • 17,993
  • 23
  • 107
  • 210
  • @NeerajJain String.contains() is not a fuzzy method. It searches for exact matches. – AVEbrahimi Apr 16 '15 at 05:56
  • some discussion over at http://stackoverflow.com/questions/2891514/algorithms-for-fuzzy-matching-strings http://stackoverflow.com/questions/327513/fuzzy-string-search-in-java http://stackoverflow.com/questions/16351641/algorithms-for-fast-string-approximate-matching – Thilo Apr 16 '15 at 05:56
  • how fast do you need it to be ? Give us something to aim at. – bvdb Apr 16 '15 at 16:39
  • do you need to determine the exact positions where the substrings occur ? Or is it enough to just know that it's in there somewhere ? – bvdb Apr 16 '15 at 16:53
  • is there a big number of substrings to search for ? – bvdb Apr 16 '15 at 16:54
  • Is that "fuzzy" idea an idea of yours to boost the performance ? Or is it a strict requirement, meaning that the word "abcde" should also match "acbde" for example. These details matter a lot. – bvdb Apr 16 '15 at 16:57
  • give us some background about the application. It could be relevant. For instance, is the word of 100k always the same one, and are the substrings of 32 characters received from a webservice call ? ... Really to vague to give a good answer right now. – bvdb Apr 16 '15 at 17:04

2 Answers2

3

Take a look at BLAST algorithm (http://en.wikipedia.org/wiki/BLAST). It is used for sequence search (eg DNA search). The basic problem is very similar to yours.

Essentially what you do is index short strings and find areas where matches are abundant, and do more computationally expensive search in that region.

ElKamina
  • 7,747
  • 28
  • 43
1

If I understand what you want right (you want to find a subsequences of a large string that are equal to a given set of strings of length 32), and your alphabet has a reasonable size (letters, digits and punctuation, for instance), then you can do the following:

  1. Find the first occurrence of each letter.

  2. For each position in the string, find the next occurrence of every letter after this position (you can do it in O(l * n) where l is the length of the string and n is the size of your alphabet by scanning from the end for each letter)

  3. For each string in your set of strings, find the first occurrence of the first letter of that string, then from that position find the first occurrence of the second letter in your string etc.

This way you spend O(l * n) time to preprocess, but then for each small string in your set you only do O(m) work where m is the length of that string.

Ishamael
  • 12,583
  • 4
  • 34
  • 52