2

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.

In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.

I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).

I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.

The algorithm (taken from the source):

Pattern matching algorithm

Explanation of the used variable names:

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

Now I want to try the algorithm using the following example (partly taken from here):

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana 

A = "banana" ; N = 6
W = "ban" ; P = 3

I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.

Here is how I would step through the algorithm:

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).

Community
  • 1
  • 1
Paddre
  • 798
  • 1
  • 9
  • 19
  • You have to calculate `lcp(v, w)` yourself, you can't deduce this from precomputed arrays - any lcp involving the input string will take time. That is where the P in `P + logN` comes from. As for the `step through algorithm` - that's interesting. I'll look it one day. – Erti-Chris Eelmaa Feb 11 '15 at 11:17
  • Your error is in that you assume w1 refers to first character in W. It doesn't. It's actually second character in W, which would make it so: `'a' <= 'a'` which evaluates to true. – Erti-Chris Eelmaa Feb 11 '15 at 11:26
  • But it is not `w1`, it is `wl` with `l=0` and therefore the first character in W (i.e. `wl = 'b'`) – Paddre Feb 12 '15 at 15:22
  • yes, you're right. As of this point, I have no explanation. I did take a look at PlogN implementation, and it made sense - the P+logN not so much. – Erti-Chris Eelmaa Feb 12 '15 at 15:43

1 Answers1

1

I believe there was an error.

First condition is fairly easy to understand. When LCP length == pattern length, it's done. When your pattern is even smaller than or equal to the smallest one, then only choice is the smallest one.

The second condition is wrong. We can prove it by contradiction. r < P || Wr <= a... means r >= P && Wr > a... If r >= P, then how can we have Lw = N(not found), since we already have r length common prefix?

LeeLee
  • 158
  • 1
  • 6
  • Thanks for the concise answer. – Paddre Apr 16 '18 at 14:14
  • @Paddre After 3 years, you still remember this question. WOW~ Did you successfully run this algorithm? I am having a serious problem. The LCP-LR algorithm from Erti-Chris Eelmaa is actually quite wrong. Firstly, range min from [a,b] = lcp from [a, b+1]. Chris' algorithm is doing range-min instead of lcp. Secondly, the left and the right sub range shall overlap 1unit on the mid point. – LeeLee Apr 17 '18 at 08:28