8

How to find the longest repeated (non-overlapping) subsequence (not substring)?

constraint:

The string S consist of at most 100.000 lower case character 'a'-'z'.

example:

String hanadswomehanudsiome has longest repeated (non-overlapping) subsequence handsome.

The expected time complexity is O(|S| log |S|) or better (|S| is length of string S).

  • are you having trouble putting this into code? – z atef Feb 22 '16 at 03:33
  • no, I still don't get the algorithm. – Dewangga Winardi Feb 22 '16 at 04:21
  • 2
    Why do you think there's an O(N log N) solution? It seems unlikely given how similar this is to the longest-common-subsequence problem. – Paul Hankin Feb 22 '16 at 06:47
  • Can you provide a link to the original problem (if possible), or else, please add more details to your question, for example, maximum number of character in the string, does the string only contains letter...? – Pham Trung Feb 22 '16 at 06:49
  • Edited :) Sorry for the unclear statement. – Dewangga Winardi Feb 22 '16 at 07:07
  • Dewangga, what is your N? Is it the length of the string? With your formulation the problem can be solved in O(1) because you have put an upper bound on the problem size. – Emil Vikström Feb 22 '16 at 07:49
  • 2
    There appear to be two plausible interpretations of 'non-overlapping' in the problem statement: The first (and much more likely) is that the first occurrence must end before the second begins. The other is that non-overlapping means only disjoint, but allows for possibly interwoven occurrences. These meanings coincide in the case of substrings. Is there any chance the second interpretation was meant? – kcsquared Oct 14 '21 at 14:16
  • 1
    The paper, "An Efficient Algorithm for the Longest Tandem Scattered Subsequence Problem" by Adrian Kosowski, seems to address this problem. I believe they conjecture O(n^2) is not possible to improve on. – גלעד ברקן Oct 15 '21 at 15:07
  • @גלעדברקן That's an excellent reference paper. There's a trivial reduction from LCS to this problem that implies `O(n^2)` is the fastest possible, based on standard hardness assumptions on LCS (although this is a decades-old, heavily researched open problem in CS). The `O(n^2)` algorithm presented in that paper, however, seems highly nontrivial to come up with. – kcsquared Oct 15 '21 at 15:26
  • Have you seen this: https://cs.stackexchange.com/questions/27776/longest-repeated-scattered-subsequence-in-a-string – gst Oct 18 '21 at 12:05

1 Answers1

2

The problem, defined in the language of computer science, is finding the longest common scattered sub-sequence (LCSS) that occurs at least twice without overlap in the input string.

This is a matter of ongoing research in computer science, and it is a sub-problem of the more general longest common subsequence problem. In one form or another, solutions to this problem have been proposed going from 1977 (Hunt-Szymanski algorithm) to the present day.

One of the more recent publications on the problem is "Small Longest Tandem Scattered Subsequences" by Russo and Francisco. It claims this time complexity for the algorithm:

O(min{n, ℓ}λ(1 + log(min{λ, ℓ/λ})) + nλ + ℓ)

Where:

  • n is the size of the input string
  • λ is the size of the longest common scattered sub-sequence
  • is the number of pairs of positions in the input string that contain the same letter

However, for a constrained input size, the time complexity collapses to O(1) as was pointed out in the comments.

mnistic
  • 10,866
  • 2
  • 19
  • 33