2

Possible Duplicate:
Algorithm to determine if array contains n…n+m?

Lets assume that M > N, and you have 2 arrays. One of length M called A and one of length N called B. Is there a faster way to find out if array B exists in array A?

For Example:

A = [ 1 2 3 4 5 6 ]

B1 = [ 2 3 4 ]

so array B1 exists in A while something like [ 1 3 2 ] does not.

This is essentially implementing something like isSubstring() in Java with a char array.

The only method that I can think of is in O(n^2) where you compare each element in A with the initial element in B and iterate through B looking for a match.

I guess this question is fairly common in interviews, so my question is asking if there is a faster approach.

Community
  • 1
  • 1
user1431282
  • 6,535
  • 13
  • 51
  • 68

1 Answers1

2

You want KMP

algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found)

define variables:
    an integer, m ← 0 (the beginning of the current match in S)
    an integer, i ← 0 (the position of the current character in W)
    an array of integers, T (the table, computed elsewhere)

while m+i is less than the length of S, do:
    if W[i] = S[m + i],
        if i equals the (length of W)-1,
            return m
        let i ← i + 1
    otherwise,
        let m ← m + i - T[i],
        if T[i] is greater than -1,
            let i ← T[i]
        else
            let i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

Complexity:

Assuming the prior existence of the table T, the search portion of the Knuth–Morris–Pratt algorithm has complexity O(k), where k is the length of S and the O is big-O notation. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the while loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of T. By definition it is constructed so that if a match which had begun at S[m] fails while comparing S[m + i] to W[i], then the next possible match must begin at S[m + (i - T[i])]. In particular the next possible match must occur at a higher index than m, so that T[i] < i. Using this fact, we will show that the loop can execute at most 2k times. For in each iteration, it executes one of the two branches in the loop. The first branch invariably increases i and does not change m, so that the index m + i of the currently scrutinized character of S is increased. The second branch adds i - T[i] to m, and as we have seen, this is always a positive number. Thus the location m of the beginning of the current potential match is increased. Now, the loop ends if m + i = k; therefore each branch of the loop can be reached at most k times, since they respectively increase either m + i or m, and m ≤ m + i: if m = k, then certainly m + i ≥ k, so that since it increases by unit increments at most, we must have had m + i = k at some point in the past, and therefore either way we would be done. Thus the loop executes at most 2k times, showing that the time complexity of the search algorithm is O(k).

additional information:

Efficiency of the KMP algorithm

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k). These complexities are the same, no matter how many repetitive patterns are in W or S.

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151