0

I'm trying to find words which contains substring (as input) in huge text. The text looks like this: *america*python*erica*escape*.. Example: Input: "rica" => Output: america,erica

I use suffix array.

My pseudocode (pythonlike) is:

firstChar=input[0] // the first character of input
suffixArray=getSuffixArray(text) // suffix array
result=[]

for every index of suffix array which points to firstChar:
    length=len(input)
    indexText=text[suffixArray[index]]
    indexes=[]

    if input in text[indexText: indexText+length]:
        word=find whole word containig this index between '*' 
        result.append(word)

This works, but it is too slow. LCP array should improve a runtime of algorhitm but I can't figure out how. Will you give me an advice?

Thanks in advance!

user3620512
  • 39
  • 1
  • 8

1 Answers1

0

A free Python code for suffix array is at Effcient way to find longest duplicate string. It works up to 100 million characters on a personal computer.

Community
  • 1
  • 1
hynekcer
  • 14,942
  • 6
  • 61
  • 99