Algorithm for 2 Pattern String Matching

Question

I need to code an algorithm for longest two pattern prefix/suffix match whose time complexity is O(n+m1+m2), where n is the length of the String and m1, m2 are the lengths of pattern1 and pattern2 respectively.

Example: if the String is "OBANTAO" and Pattern1 is "BANANA" and Patten2 is "SIESTA" then the answer is the substring "BANTA" of String that consists of the prefix BAN of BANANA and the suffix TA of SIESTA.

Results from Google are: "Rabin-karp string search algorithm", "Knuth-morris-pratt algorithm" and "Boyer-moore string search algorithm".

I'm able to understand all the above 3 algorithms, but the problem is that, they are all based on "single pattern prefix/suffix matching". I'm unable to extend them for two pattern prefix/suffix match.

A sample algorithm or a link towards searching it would be greatly helpful for me in developing the program.

Is it always a substring? or a subsequence is also fine? Is it always prefix and suffix? For example in the string `lananaest` - will ananaest be a good solution because anana is a substring of banana and est is a substring of siesta? — amit, Feb 17 '14 at 20:23
What information would this yeild for you in practical terms? — , Feb 17 '14 at 20:26
Is the pattern you want to match always `(prefix of Pattern 1) + (suffix of pattern 2)`? So what you really want is the longest substring that consists of `p1 + s2`? You really need to specify your problem a little better. — Jim Mischel, Feb 17 '14 at 20:38
@amit Sorry not like that, the matching string should always be a clipped prefix from banana and a clipped suffix from siesta... i,e. (B/BA/BAN/BANA/BANAN) from BANANA and (A/TA/STA/ESTA/IESTA) from SIESTA — , Feb 17 '14 at 20:38
Welcome to StackOverflow! Please update your question when answering a comment (as this means your question didn't provide enough info), so that things are clearer for everyone, and try to use code markup when necessary to make your text more readable. — Robin, Feb 17 '14 at 20:41
@sin, I'm processing biological (DNA/RNA/Protein) sequences... A sample pattern from lab results can be compared with the genome search in order to find the similarities between the species. Many algorithms are available to compare a single pattern in a Sequence. So I'm doing separate searches and then comparing the results via concatenation for analysis. But i couldn't find any relevant algorithm to implement a two pattern matching program using which I could make my analysis faster and better... — , Feb 17 '14 at 20:44
@DhiwaTdG Will be possible to occur multiple occurrence? Do we need care in that case ? for example . BTABASTA . here BTA and BASTA both as valid case — Mani, Feb 17 '14 at 22:32
@Mani You are right, but BASTA is taken into consideration since it is longer than BTA — , Feb 17 '14 at 23:26

David Eisenstat · Accepted Answer · 2014-02-18T00:28:34.290

3

Knuth--Morris--Pratt can be modified straightforwardly to yield, for each position in the haystack string, the length of the longest prefix of the needle string with a match ending at that position. Use KMP to compute this information for Pat1 in String and reverse(Pat2) in reverse(String), then iterate over each position in String looking for the maximum prefix/suffix length.

Example with String = "OBANTAO" and Pat1 = "BANANA" and Pat2 = "SIESTA":

"BANANA" = Pat1 in String
 O B A N T A O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 0 ("")
| | | | 3 ("BAN")
| | | 2 ("BA")
| | 1 ("B")
| 0 ("")
0 ("")

"ATSEIS" = reverse(Pat2) in reverse(String)
 O A T N A B O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 1 ("A")
| | | | 0 ("")
| | | 2 ("AT")
| | 1 ("A")
| 0 ("")
0 ("")

Reverse the second array and sum componentwise.

  0 0 1 2 3 0 0 0
+ 0 0 1 0 2 1 0 0
-----------------
  0 0 2 2 5 1 0 0
          ^
          |
         max (argument = "BAN" ++ reverse("AT"))

edited Feb 18 '14 at 00:28

answered Feb 17 '14 at 21:23

David Eisenstat

64,237
7
60
120

Sorry I couldn't get to your point. could you please give me a sample demonstration of it? – Feb 17 '14 at 23:36
I like your algorithm. just wanted to make sure if i have input as BANABTA ,then your algorithm choose BANA( which doesnt has prefix) is that acceptable ? – Mani Feb 18 '14 at 15:12
@Mani I have no idea, but it's easy enough not to consider the positions where at least one array has a zero. – David Eisenstat Feb 18 '14 at 15:13
@David Eisenstat.. Thankyou so much..!!! It was really useful and I made an implementation out of it. and btw, now I'm able to compare two patterns and if necessary ill add some more too..!!! Thanks to all who have replied.!!! – Feb 19 '14 at 21:30

Mani · Answer 2 · 2014-02-18T19:00:07.850

I tried to Implement @David Eisenstat solution in Java. It is in O(2n) time and O(2(n+1)) Auxiliary Spaces

String prefix = "BANANA";
    String suffix = "SIESTA";
    String input = "SABANANQS";

    // Prepare Inputs
    char[] prefixArray = prefix.toCharArray();
    char[] suffixArray = suffix.toCharArray();
    char[] inputArray = input.toCharArray();
    int inputLength = inputArray.length;
    int suffixLength = suffixArray.length;
    int prefixLength = prefixArray.length;
    // Auxiliary Spaces O(2(n+1))
    int[] prefixIndexs = new int[inputLength+1];
    int[] suffixIndexs = new int[inputLength+1];

    int m = 0;
    int n = 0;
    // O(1)
    for (int i = 0; i < inputLength; i++) {
        if (inputArray[i] == prefixArray[m]){
            m = m+1;
            prefixIndexs[i+1] = m;
            if (m == prefixLength) {
                m = 0;
            }
        }else{
            m = 0;
        }
        if (inputArray[inputLength-1-i] == suffixArray[suffixLength-1-n]){   // Reverse order or input and reverse oder of suffix
            n = n +1;
            suffixIndexs[i+1] = n;
            if (n == suffixLength) {
                n = 0;
            }
        }else{
            n = 0;
        }
    }

    int currmax =0;
    int mIndex = 0; // prefix Index from start
    int nIndex = 0; // suffix Index from End
    //O(1)  - Do Sum and find the max
    for (int i = 0; i < (inputLength+1); i++) {
        m = prefixIndexs[i];
        n = suffixIndexs[inputLength-i];
        if ( m != 0 && n != 0){  // if both prefix and suffix exists
            if (m+n > currmax){
                currmax = (m+n);
                mIndex = m;
                nIndex = n;
            }
        }
    }

    System.out.println("Input :"+input);
    System.out.println("prefix :"+prefix);
    System.out.println("suffix :"+suffix);
    System.out.println("max :"+currmax);
    System.out.println("mIndex :"+mIndex);
    System.out.println("nIndex :"+nIndex);
    System.out.println(prefix.substring(0,mIndex)+suffix.substring(suffix.length() - nIndex,suffix.length()));

Instead of reset m and n to 0, we can keep an another array for each to implement KMP algorithm , since the input prefix and sufix doesnt has repetitive char sequence i left it

Algorithm for 2 Pattern String Matching

2 Answers2