3

I need to find a substring SIMILAR to a given pattern in a huge string. The source huge string could be up to 100 Mb in length. The pattern is rather short (10-100 chars). The problem is that I need to find not only exact substrings but also similar substrings that differ from the pattern in several chars (the maximum allowed error count is provided as a parameter).

Is there any idea how to speed up the algorithm?

Vitas
  • 243
  • 4
  • 10
  • are you looking for an algorithm that is optimized for a single query? Or an [indexing strategy](http://en.wikipedia.org/wiki/Index_(search_engine)) that will create the data structure given the 100MB source text so that all queries of similar nature will be optimized? – rwong Jun 19 '11 at 11:29
  • I found [a similar question](https://stackoverflow.com/questions/17740833/checking-fuzzy-approximate-substring-existing-in-a-longer-string-in-python) with several solutions to this problem in Python. – Anderson Green Apr 15 '22 at 23:47
  • To find similar substrings in two different strings, I would use a [sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment) algorithm. – Anderson Green Jun 22 '23 at 17:28

3 Answers3

1

1)There are many Algorithms related to the string searching. One of them is the famous Knuth–Morris–Pratt Algorithm.

2) You may also want to check Regular Expressions ("Regex") in whatever the language you're using. They will sure help you finding substrings 'similar' to the original ones.

i.e. [Java]

String pat = "Home";
String source = "IgotanewHwme";

for(int i = 0; i < pat.length(); i++){
    //split around i .. not including char i itself .. instead, replace it with [a-zA-Z] and match using this new pattern.
    String new_pat = "("+pat.substring(0, i)+")"+ "[a-zA-Z]" + "("+pat.substring(i+1, pat.length())+")";
    System.out.println(new_pat);
    System.out.println(source.matches("[a-zA-Z]*"+new_pat+"[a-zA-Z]*"));
}

and I think it's easy to make it accept any number of error counts.

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
  • The Knuth-Morris-Pratt algorithm is not an [approximate string search](https://en.wikipedia.org/wiki/Approximate_string_matching) algorithm, but the [bitap algorithm](https://en.wikipedia.org/wiki/Bitap_algorithm) is. – Anderson Green Apr 15 '22 at 22:51
0

Sounds like you want Fuzzy/Approximate String Matching. Have a look at the Wikipedia page and see if you can't find an algorithm that suits your needs.

Asgeir
  • 1,092
  • 1
  • 7
  • 11
0

You can have a look at the Levenshtein distance, the Needleman–Wunsch algorithm and the Damerau–Levenshtein distance

They give you metrics evaluating the amount of differences between two strings (ie the numbers of addition, deletion, substitution, etc.). They are often used to measure the variation between DNA.

You will easily find implementations in various languages.

ThibThib
  • 8,010
  • 3
  • 30
  • 37