Implementation of string pattern matching using Suffix Array and LCP(-LR)

Question

During the last weeks I tried to figure out how to efficiently find a string pattern within another string.

I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.

I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).

But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.

References:

Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058

UPDATE 1

Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array

UPDATE 2

Added link to referenced paper

C/C++ would be fine. Or any language implementation that is portable to C/C++ with less effort than implementing it on my own — Paddre, Feb 03 '15 at 12:29
I've modified the question slightly, as requests for implementations and libraries are generally off-topic here (though given you've evidently done some good research, readers may be somewhat more indulgent). — halfer, Feb 07 '15 at 08:47
@Paddre a c implementation would be very different from a c++ one, so please choose a language to get better help. — Iharob Al Asimi, Feb 07 '15 at 12:34
Since the rest of my program which would use it is written in C++, a C++ way to implement it would do it. However, since I'm interested in high efficiency, a C implementation would be fine as well (even though I avoid mixing up C and C++ code, for the sake of efficiency, I would do it ;-) ) — Paddre, Feb 07 '15 at 14:13

Erti-Chris Eelmaa · Accepted Answer · 2015-02-08T13:27:26.680

6

The termin that can help you: enchanced suffix array, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).

These can be some of the examples:

https://code.google.com/p/esaxx/ ESAXX

http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA

The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.

If you take a look at the referenced paper, it mentions an useful property (4.2). Since SO does not support math, there is no point to copy it here.

I've done quick implementation, it uses segment tree:

// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1

// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
// 

// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]

// rangeI = LCP_LR[i]
//   rangeILeft  = LCP_LR[2 * i]
//   rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
    if(low == high)
    {
        LCP_LR[index] = LCP[low];
        return;
    }

    int mid = (low + high) / 2;

    buildLCP_LR(2*index, low, mid);
    buildLCP_LR(2*index+1, mid + 1, high);

    LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}

edited Feb 08 '15 at 13:27

answered Feb 07 '15 at 18:17

Erti-Chris Eelmaa

25,338
6
61
78

It does not. "Enhanced suffix array" is a generic term. The OP (and the bounty I've posted) is looking for a very specific aspect of an enhanced suffix array, and this is the construction of the LCP-LR array as described in the paper cited in the OP. Specifically, this enables `O(M + logN)` pattern matching on a suffix array. – BurntSushi5 Feb 07 '15 at 18:39
@BurntSushi5: What I mean't is that the termin leads you to closer finding perhaps some kind of solution (it lead me to ESAXX which shows how to efficiently find substrings from string). – Erti-Chris Eelmaa Feb 07 '15 at 23:42
@BurntSushi5: If you're interested in this specific paper & it's concrete implementation, I suggest you to add the paper here. Also, I did take a look: if you've managed to create LCP by using that paper (through interval tree), you already have **computed** the values of LCP-LR (you'd have to copy them to array from interval tree). Where are you stuck exactly? – Erti-Chris Eelmaa Feb 07 '15 at 23:47
I feel like the OP is already incredibly specific. We'd like to do efficient pattern matching (in `O(m + logn)` time). The paper cited provides an algorithm for doing this, but it requires the LCP-LR array. So what we want is: given a suffix array (not enhanced, just the plain ol' suffix array) and the LCP array, how does one compute the LCP-LR array in `O(n)` time? – BurntSushi5 Feb 08 '15 at 03:17
The OP links to [this answer](http://stackoverflow.com/questions/11373453/how-does-lcp-help-in-finding-the-number-of-occurences-of-a-pattern/11374737#11374737), which says, "Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array." But I don't know what it is. :-) – BurntSushi5 Feb 08 '15 at 03:17
@BurntSushi5: Updated my post, see if that helps. Ps, "straight-forward" my ass ;) – Erti-Chris Eelmaa Feb 08 '15 at 10:00
1

Very nice! The bounty ends in a few hours and I won't have time to actually verify it, but it looks plausible enough to me, so I gave your answer the bounty. :-) And it certainly gives me some threads to pull on. When I actually implement it, I'll post any updates if necessary. Thank you! :-) – BurntSushi5 Feb 08 '15 at 14:34
As far as I can see, your implementation works fine and meets my requirements :-) Could you give some advice how to use it best for string pattern matching? I've read the algo in the paper but I have still problems applying the implementation to the pseudocode :-/ – Paddre Feb 09 '15 at 01:46
I posted a new question regarding this here: http://stackoverflow.com/questions/28444226/understanding-the-algorithm-for-pattern-matching-using-an-lcp-array – Paddre Feb 11 '15 at 00:11

Rontogiannis Aristofanis · Answer 2 · 2015-02-09T15:25:25.967

Here is a fairly simple implementation in C++, though the build() procedure builds the suffix array in O(N lg^2 N) time. The lcp_compute() procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)

#include <stdio.h>
#include <string.h>
#include <algorithm>

using namespace std;

const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];

bool compare(int i, int j) {
  if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
  i += h, j += h; // if prefvious comparing failed, use 2*h chars
  return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}

void build() {
  N = strlen(str);
  for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
  for(h=1;;h<<=1) {
    sort(sa, sa+N, compare); // sort suffixes
    for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
    for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
    if(tmp[N-1] == N-1) break; // check if done
  }
}

void lcp_compute() {
  for(int i=0, k=0; i<N; ++i)
    if(pos[i] != N-1) {
      for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
      lcp[pos[i]] = k;
      if(k) k--;
    }
}

int main() {
  scanf("%s", str);
  build();
  for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
  return 0;
}

Note: If you want the complexity of the build() procedure to become O(N lg N), you can replace the STL sort with radix sort, but this is going to complicate the code.

Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text, and the pattern. Given these string you create a new one, lets call it concat, which is the concatenation of the two given strings (first the text, then the pattern). You run the suffix array construction algorithm on concat, and you build the normal lcp array. Then, you search for a suffix of length pattern.size() in the suffix array you just built. Lets call its position in the suffix array pos. You then need two pointers lo and hi. At start lo = hi = pos. You decrease lo while lcp(lo, pos) = pattern.size() and you increase hi while lcp(hi, pos) = pattern.size(). Then you search for a suffix of length at least 2*pattern.size() in the range [lo, hi]. If you find it, you found a match. Otherwise, no match exists.

Edit[2]: I will be back with an implementation as soon as I have one...

Edit[3]:

Here it is:

// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str

bool match(int tlen, int plen) {
  int total = tlen + plen;
  int pos = -1;
  for(int i=0; i<total; ++i)
    if(total-sa[i] == plen)
      { pos = i; break; }
  if(pos == -1) return false; 
  int lo, hi;
  lo = hi = pos;
  while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
  while(hi+1 <  N && lcp[hi] >= plen) hi++;
  for(int i=lo; i<=hi; ++i)
    if(total-sa[i] >= 2*plen)
      return true;
  return false;
}

I already have an efficient construction implementation. However as stated in the question, I am particularly interested in an implementation that can be used for efficient pattern matching. As far as I know, this is not possible by only using the "plain" LCP and array because of the lack of being able to check every possible match. Correct me if I'm wrong here ;-) — Paddre, Feb 07 '15 at 14:06
@Paddre I added the code for the `match` procedure to my answer — Rontogiannis Aristofanis, Feb 08 '15 at 12:16
I could manage to test it. First: you should mention, that the terminating symbol (often referred to as `#` or `$`) must be added by the user and is not part of the SA construction mechanism. Once I found out it was clear from your code, but not what I expected having worked with lots of other SA/LCP implementations before. Second: The matching does not seem to work properly. I tried `"vwvwxy#` as `text` and `"vwx"` as `pattern` (i.e. `concat = "vwvwxy#vwx") and `match()` returned `false`. — Paddre, Feb 08 '15 at 19:39
@Paddre You shouldn't add a terminating symbol. In the case you mentioned `concat` should be `"vwvwxyvwx"`. Try it, and you will see that `match()` returns `true`. — Rontogiannis Aristofanis, Feb 09 '15 at 15:19
Nope. with `str="vwvwxyvwx"`, `match(6,1)` returns `false`. (SA: `{0,6,2,1,7,3,8,4,5}`) — Paddre, Feb 09 '15 at 22:25
@Paddre You should call `match(6, 3)`, not `match(6, 1)`. Also make sure that you build the lcp array the same way as I do. In my code, `lcp[i]` denotes the longest common prefix between `sa[i]` and `sa[i+1]`. If in your code `lcp[i]` denotes the longest common prefix between `sa[i]` and `sa[i-1]`, then the match procedure doesn't work properly... — Rontogiannis Aristofanis, Feb 10 '15 at 19:47
Sorry the `match(6,1)` was a typo. My mistake was that I forgot to invoke `lcp_compute()` your code works fine thanks :-) — Paddre, Feb 10 '15 at 22:33

score 0 · Answer 3 · edited May 23 '17 at 12:02

0

Here is a nice post including some code to help you better understand LCP array and comparison implementation.

I understand your desire is the code, rather than implementing your own. Although written in Java this is an implementation of Suffix Array with LCP by Sedgewick and Wayne from their Algorithms booksite. It should save you some time and should not be tremendously hard to port to C/C++.

LCP array construction in pseudo for those who might want more information about the algorithm.

edited May 23 '17 at 12:02

Community

1
1

answered Feb 07 '15 at 10:16

Kerem

2,867
24
35

1

Unfortunately none of them gives me information about constructing LCP arrays which are appropriate for efficient pattern matching – Paddre Feb 07 '15 at 14:09

score 0 · Answer 4 · answered Apr 17 '18 at 08:46

I think @Erti-Chris Eelmaa 's algorithm is wrong.

L ... 'M ... M ... M' ... R
       |-----|-----|

Left sub range and right sub range should all contains M. Therefore we cannot do normal segment tree partition for LCP-LR array. Code should look like

def lcp_from_i_j(i, j): # means [i, j] not [i, j)
    if (j-i<1) return lcp_2_elem(i, j)
    return lcp_merge(lcp_from_i_j(i, (i+j)/2), lcp_from_i_j((i+j)/2, j)

The left and the right sub ranges overlap. The segment tree supports range-min query. However, range min between [a,b] is not equal to lcp between [a,b]. LCP array is continuous, simple range-min would not work!

Implementation of string pattern matching using Suffix Array and LCP(-LR)

4 Answers4

Linked