1

I'm trying to implement the algorithm given by Evgeny Kluev in his answer to loop rolling algorithm but I'm having trouble getting it to work. Below is an example I tried to work out by hand following his instructions:

text: ababacababd


<STEP 1>                                    
  suffixes and LCPs:                               

  ababacababd                                            
  4 
  ababd 
  3 
  abacababd 
  2  
  abd
  1
  acababd
  0
  babacababd
  3
  babd
  2
  bacababd
  1
  bd
  0
  cababd
  0
  d

<STEP 2>
  sorted LCP array indices: 0,1,5,2,6,3,7,4,8,9
                 (values) : 4,3,3,2,2,1,1,1,0,0

<STEP 3>
LCP groups sorted by position in text (format => position: entry):
  lcp 4:
    0: ababacababd
    6: ababd

  lcp 3:
    1: babacababd
    2: abacababd
    6: ababd
    7: babd

  lcp 2:
    2: abacababd
    3: bacababd
    7: babd
    8: abd

  lcp 1:
    3: bacababd
    4: acababd
    8: abd
    9: bd

  lcp 0:
    0: ababacababd
    1: babacababd
    4: acababd
    5: cababd
    9: bd
   10: d

<STEP 4>
entries remaining after filter (LCP == positional difference):
   none! only (abd),(bd) and (bacababd),(acababd) from LCP 1 group
   have positional difference equal to 1 but they don't share prefixes
   with each other. shouldn't i have at least (ab) and (ba) here?

Can anybody tell me what I'm doing wrong in this process?

Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?

Is this a known algorithm with a name that I can find more details on elsewhere?

(I'm also confused about his definition of intersecting sequences in step 5, but maybe it would make sense if I understood the preceding steps correctly).

EDIT: Here is what I have for step 4,5,6 after Evgeny's helpful clarification:

<STEP 4>
filter pseudocode:
results = {}
for (positions,lcp) in lcp_groups:
  results[lcp] = []
  while positions not empty:
    pos = positions.pop(0) #pop lowest element
    if (pos+lcp) in positions:
      common = prefix(input, pos, lcp)
      if common.size() < lcp:
        next
      i = 1
      while (pos+lcp*(i+1)) in positions:
        if common != prefix(input, pos+lcp*i, lcp):
          break
        positions.delete(pos+lcp*i)
        i += 1

      results[lcp].append( (common, pos, i+1) )

application of filter logic:
  lcp 4:
    0: ababacababd # 4 not in {6}
    6: ababd       # 10 not in {}

  lcp 3:
    0: ababacababd # 3 not in {1,2,6,7}
    1: babacababd  # 4 not in {2,6,7}
    2: abacababd   # 5 not in {6,7}
    6: ababd       # 9 not in {7}
    7: babd        # 10 not in {}

  lcp 2:
    0: ababacababd # 2 in {1,2,3,6,7,8}, 4 not in {1,2,3,6,7,8} => ("ab", 0, 2)
    1: babacababd  # 3 in {2,3,6,7,8}, 5 not in {2,3,6,7,8} => ("ba", 1, 2)
    2: abacababd   # 4 not in {3,6,7,8}
    3: bacababd    # 5 not in {6,7,8}
    6: ababd       # 8 in {7,8}, 10 not in {7,8} => ("ab", 6, 2)
    7: babd        # 9 not in {8}
    8: abd         # 10 not in {}

  lcp 1:
    0: ababacababd # 1 in {1,2,3,4,6,7,8,9}, prefix is ""
    1: babacababd  # 2 in {2,3,4,6,7,8,9}, prefix is ""
    2: abacababd   # 3 in {3,4,6,7,8,9}, prefix is ""
    3: bacababd    # 4 in {4,6,7,8,9}, prefix is ""
    4: acababd     # 5 not in {6,7,8,9}
    6: ababd       # 7 in {7,8,9}, prefix is ""
    7: babd        # 8 in {8,9}, prefix is ""
    8: abd         # 9 in {9}, prefix is ""
    9: bd          # 10 not in {}

sequences: [("ab", 0, 2), ("ba", 1, 2), ("ab", 6, 2)]

<STEP 5>
add sequences in order of LCP grouping. sequences within an LCP group
are added according to which was generated first:
  lcp 4: no sequences
  lcp 3: no sequences
  lcp 2: add ("ab", 0, 2)
  lcp 2: dont add ("ba", 1, 2) because it intersects with ("ab", 0, 2)
  lcp 2: add ("ab", 6, 2)
  lcp 1: no sequences

collection = [("ab", 0, 2), ("ab", 6, 2)]
(order by position not by which one was added first)

<STEP 6>
recreate input by iterating through the collection in order and 
filling in gaps with the normal input:
  input = "ab"*2 + input[4..5] + "ab"*2 + input[10..10]

Evgeny, one quick question for you if you happen to look at this again: Am I doing step 5 correctly? That is, do I add sequences according to which LCP group they were generated from (with higher LCP valued groups coming first)? Or is it something else related to LCP?

Also, if there is anything wrong with step 4 or 6 please let me know, but it seems what I have works very well for this example.

Community
  • 1
  • 1
yorble
  • 81
  • 4

1 Answers1

1

I have to clarify what is meant by "grouping by LCP value" in the original answer. In fact, to the group with selected LCP value we should include all entries with larger LCP values.

This means that for your example, while processing LCP3, we need to merge preceding entries 0 and 6 to this group. And while processing LCP2, we need to merge all preceding entries with LCP3 and LCP4: 0, 1, 2, 6, 7.

As a result two (ab) pairs as well as one (ba) pair are remaining after filter. But since (ba) "intersects" with the first (ab) pair, it is rejected on step 5.

Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?

That's right, I mean all possible repeating sequences.

Is this a known algorithm with a name that I can find more details on elsewhere?

I don't know. Never seen such algorithm before.


Here is how steps 2 .. 4 may be implemented (in pseudo-code):

for (in_sa, in_src) in suffix_array: # step 2
  lcp_groups[max(LCP[in_sa.left], LCP[in_sa.right])].append((in_sa, in_src))
apply(sort_by_position_in_src, lcp_groups) # step 3
for lcp from largest downto 1: # step 4
  # sorted by position in src array and unique:
  positions = merge_and_uniq(positions, lcp_groups[lcp])
  for start in positions:
    pos = start
    while (next = positions[pos.in_src + lcp]).exists
           and LCP.RMQ(pos.in_sa, next.in_sa) >= lcp
           and not(prev = positions[pos.in_src - lcp]).exists  # to process each
                   and LCP.RMQ(pos.in_sa, prev.in_sa) >= lcp): # chain only once
      pos = next
    if pos != start:
      pass_to_step5(start, lcp, pos + lcp)

Here I don't plan any particular data structure for positions. But for convenience an ordered associative array is assumed. RMQ is Range Minimum Query, so LCP array should be preprocessed accordingly.

This code is practically the same as the code in OP. But instead of expensive string comparisons (like common != prefix(input, pos+lcp*i, lcp)) it uses RMQ, which (if properly implemented) works almost instantly (and has the same effect as the string comparison because it allows to reject a sub-string when it has too few starting characters in common with preceding sub-string).

It has quadratic worst-case time complexity. Should be slow for input arrays like "aaaaaaaaa". And it's not easy to find its time complexity for "better" strings, probably it is sub-quadratic in "average" case. The same problem could be solved with much simpler quadratic-time algorithm:

def loop_rolling(begin, end):
  distance = (end - begin) / 2)
  for d from distance downto 1:
    start = pos = begin
    while pos + d < end:
      while (pos + d < end) and (src[pos] == src[pos + d]):
        ++pos
      repeats = floor((pos - start) / d)
      if repeats > 0:
        pass_to_step5(start, d, start + d * (repeats + 1))
      start = pos

Or it may be made even simpler by removing steps 5 and 6. But the variant below has a disadvantage. It is much too greedy, so instead of 5*(ab) it will find 2*(2*(ab))ab:

def loop_rolling(begin, end, distance):
  distance = min(distance, (end - begin) / 2))
  for d from distance downto 1:
    start = pos = begin
    while pos + d < end:
      while (pos + d < end) and (src[pos] == src[pos + d]):
        ++pos
      repeats = floor((pos - start) / d)
      if repeats > 0:
        loop_rolling(begin, start, d - 1)
        print repeats+1, "*("
        loop_rolling(start, start + d, d - 1) # "nested loops"
        print ')'
        loop_rolling(start + d * (repeats + 1), end, d)
        return
      else:
        if d == 1: print src[start .. pos]
        start = pos
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • thanks so much for this algorithm and for helping explain it. please see my update if you have time to answer my follow up questions. – yorble Oct 15 '13 at 06:58
  • @yorble: I think, you are doing everything correctly, including step 5 where higher LCP valued groups should indeed come first. As for pseudo-code, while it is also correct, I think it might be optimized a little bit: conditions `prefix.size() < lcp` and `common != prefix(input, pos+lcp*i, lcp)` cannot be ever satisfied, so instead of part of the algorithm they may be turned into assertions; also it's not really necessary to extract actual characters from the string at least until step 6 (it's enough to use only indexes on steps 4 and 5). – Evgeny Kluev Oct 15 '13 at 09:12
  • what about when we are going through LCP group for LCP=1 with the filter process: (0,ababacababd) and (1,babacababd) are the first two entries in that group, and they have positional difference equal to LCP (1), so if we dont check whether they actually have a shared prefix, wouldn't we mistakenly think there is an "a" or "b" sequence when there really isn't? In other words, this is an example where positional difference being equal to LCP doesn't actually yield a repeating sequence. This is the reason why I added `prefix.size() < lcp` check. – yorble Oct 15 '13 at 10:20
  • @yorble: probably I misunderstood your pseudo-code and you actually use some brute-force approach to get first pair of matching substrings... In the algorithm that I propose, while processing the group LCP1 you'll never have to process (0,ababacababd) and (1,babacababd) pair. My algorithm assumes that you compute difference between adjacent positions in SA (before it is sorted by position) and compare this difference to current LCP group's value... – Evgeny Kluev Oct 15 '13 at 11:00
  • ... Which gives fist repeating pair. And then you could use your nested `while` loop to append additional repetitions. – Evgeny Kluev Oct 15 '13 at 11:02
  • @yorble: linked answer gives only the general idea of the algorithm. So I written some pseudo-code to make it clearer. Also quite a long time passed since that answer was posted; for now I would prefer a simpler approach to the same problem (see second pseudo-code). – Evgeny Kluev Oct 15 '13 at 15:39