Fast way to find substring in text using suffix array and lcp

Question

I'm trying to find words which contains substring (as input) in huge text. The text looks like this: *america*python*erica*escape*.. Example: Input: "rica" => Output: america,erica

I use suffix array.

My pseudocode (pythonlike) is:

firstChar=input[0] // the first character of input
suffixArray=getSuffixArray(text) // suffix array
result=[]

for every index of suffix array which points to firstChar:
    length=len(input)
    indexText=text[suffixArray[index]]
    indexes=[]

    if input in text[indexText: indexText+length]:
        word=find whole word containig this index between '*' 
        result.append(word)

This works, but it is too slow. LCP array should improve a runtime of algorhitm but I can't figure out how. Will you give me an advice?

Thanks in advance!

score 0 · Answer 1 · edited May 23 '17 at 10:25

0

A free Python code for suffix array is at Effcient way to find longest duplicate string. It works up to 100 million characters on a personal computer.

edited May 23 '17 at 10:25

Community

1
1

answered Jan 13 '15 at 21:34

hynekcer

14,942
6
61
99

Fast way to find substring in text using suffix array and lcp

1 Answers1