Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
50
votes
1 answer

Suffix Array Algorithm

After quite a bit of reading, I have figured out what a suffix array and LCP array represents. Suffix array: Represents the _lexicographic rank of each suffix of an array. LCP array : Contains the maximum length prefix match between two consecutive…
Spandan
  • 2,128
  • 5
  • 25
  • 37
40
votes
3 answers

What's the current state-of-the-art suffix array construction algorithm?

I'm looking for a fast suffix-array construction algorithm. I'm more interested in ease of implementation and raw speed than asymptotic complexity (I know that a suffix array can be constructed by means of a suffix tree in O(n) time, but that takes…
Cameron
  • 96,106
  • 25
  • 196
  • 225
23
votes
2 answers

How does LCP help in finding the number of occurrences of a pattern?

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string. Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find…
Cratylus
  • 52,998
  • 69
  • 209
  • 339
17
votes
2 answers

Suffix Arrays vs Suffix Trees

I just wanna know, when a suffix tree is superior to an enhanced suffix array. After reading Replacing suffix trees with enhanced suffix arrays i don't see a reason to use suffix trees anymore. Some methods can get complicated, but you can do…
14
votes
8 answers

Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available. When overlapping is allowed, the answer is trivial (deepest parent node in suffix tree). For example for…
14
votes
7 answers

Finding the longest repeated substring

What would be the best approach (performance-wise) in solving this problem? I was recommended to use suffix trees. Is this the best approach?
kukit
  • 307
  • 1
  • 3
  • 8
12
votes
2 answers

How to sort array suffixes in block sorting

I'm reading the block sort algorithm from the Burrows and Wheeler paper. This a step of the algorithm: Suppose S= abracadabra Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged…
Guido Tarsia
  • 1,962
  • 4
  • 27
  • 45
11
votes
3 answers

Suffix array nlogn creation

I have been learning suffix arrays creation, & i understand that We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is…
user2565192
  • 694
  • 1
  • 8
  • 19
11
votes
1 answer

Errata in the original paper on suffix arrays?

I'm looking at the pseudo-code given in figure 3 of the original paper introducing suffix arrays "SUFFIX ARRAYS: A NEW METHOD FOR ON-LINE STRING SEARCHES". I can't figure out the logic for lines 4 and 5 (indexing from 0). The lines reads: else if…
nomad
  • 1,809
  • 2
  • 18
  • 33
11
votes
4 answers

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c When I implement it in Python using suffix-array: example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x…
Hanfei Sun
  • 45,281
  • 39
  • 129
  • 237
11
votes
1 answer

Shortest uncommon substring: shortest substring of one string, that is not a substring of another string

We need to find shortest uncommon substring between two strings i.e. if we have two strings a and b so we need to find the length of shortest substring of a that is not a substring of b. How to solve this problem using suffix array ? To be solved…
Aman Singhal
  • 835
  • 1
  • 11
  • 24
10
votes
1 answer

Kasai Algorithm for Constructing LCP-Array Practical Example

I am attempting to complete the Algorithm's on Strings course on Coursera and am stuck on the method to construct an LCP array described in this video: https://www.coursera.org/learn/algorithms-on-strings/lecture/HyUlH/computing-the-lcp-array I am…
10
votes
3 answers

Constructing a Good Suffix Table - Understanding an example

I'm really trying to understand an example on how to construct a good suffix table for a given pattern. The problem is, I'm unable to wrap my head around it. I've looked at numerous examples, but do not know where the numbers come from. So here…
yulai
  • 741
  • 3
  • 20
  • 36
10
votes
4 answers

strcmp for python or how to sort substrings efficiently (without copy) when building a suffix array

Here's a very simple way to build an suffix array from a string in python: def sort_offsets(a, b): return cmp(content[a:], content[b:]) content = "foobar baz foo" suffix_array.sort(cmp=sort_offsets) print suffix_array [6, 10, 4, 8, 3, 7, 11, 0,…
ephes
  • 1,451
  • 1
  • 13
  • 19
8
votes
9 answers

Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)

I'm looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I've found out about tries, suffix-trees and suffix-arrays. But I couldn't find an ready-to-use implementation in C/C++ so far (and…
Constantin
  • 8,721
  • 13
  • 75
  • 126
1
2 3
10 11