Effcient way to find longest duplicate string for Python (From Programming Pearls)

Question

From Section 15.2 of Programming Pearls

The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

When I implement it in Python using suffix-array:

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i

I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).

The tested file can be downloaded from here

The C codes need only 0.3 seconds to finish.

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s

But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)

Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)

How long does the C code take? How long does your code take? — beatgammit, Nov 26 '12 at 07:00
@tjameson C codes use 0.3 seconds. I don't know how long my codes takes as it never ends(at least 10 minutes).. — Hanfei Sun, Nov 26 '12 at 07:02
The C code is slow because it fails to keep track of the "longest match so far" when sorting, and has to check everything a second time. The Python is slow for the same reason, plus because it's operating on strings and not pointers to strings, plus because it's Python. — Brendan, Nov 26 '12 at 07:21
`example[a:]` copies a string each time (`O(N)`). So your sort is `O(N*N*logN)`. For iliad it is `~10**12` operation that is slow. — jfs, Nov 26 '12 at 07:28
Since Programming Swines, err, sorry Pearls, relies heavily on various forms of undefined, unspecified and imp.defined behavior, you cannot easily translate code from it to another language which doesn't have the same kind of non-specified behavior. — Lundin, Nov 26 '12 at 08:00
@HanfeiSun, I don't think the best answer is the accepted one. For sake of future visitors like me, would you please take a look? — Shihab Shahriar Khan, Dec 08 '17 at 14:33

score 16 · Answer 1 · edited Jul 05 '21 at 12:35

My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.

from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
    """Get the longest common substrings and their positions.
    >>> longest_common_substring('banana')
    {'ana': [1, 3]}
    >>> text = "not so Agamemnon, who spoke fiercely to "
    >>> sorted(longest_common_substring(text).items())
    [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

    This function can be easy modified for any criteria, e.g. for searching ten
    longest non overlapping repeated substrings.
    """
    sa, rsa, lcp = suffix_array(text)
    maxlen = max(lcp)
    result = {}
    for i in range(1, len(text)):
        if lcp[i] == maxlen:
            j1, j2, h = sa[i - 1], sa[i], lcp[i]
            assert text[j1:j1 + h] == text[j2:j2 + h]
            substring = text[j1:j1 + h]
            if not substring in result:
                result[substring] = [j1]
            result[substring].append(j2)
    return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
    """Analyze all common strings in the text.

    Short substrings of the length _step a are first pre-sorted. The are the 
    results repeatedly merged so that the garanteed number of compared
    characters bytes is doubled in every iteration until all substrings are
    sorted exactly.

    Arguments:
        text:  The text to be analyzed.
        _step: Is only for optimization and testing. It is the optimal length
               of substrings used for initial pre-sorting. The bigger value is
               faster if there is enough memory. Memory requirements are
               approximately (estimate for 32 bit Python 3.3):
                   len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

    Return value:      (tuple)
      (sa, rsa, lcp)
        sa:  Suffix array                  for i in range(1, size):
               assert text[sa[i-1]:] < text[sa[i]:]
        rsa: Reverse suffix array          for i in range(size):
               assert rsa[sa[i]] == i
        lcp: Longest common prefix         for i in range(1, size):
               assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
               if sa[i-1] + lcp[i] < len(text):
                   assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
    >>> suffix_array(text='banana')
    ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

    Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
    The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
    It is between  tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
    """
    tx = text
    size = len(tx)
    step = min(max(_step, 1), len(tx))
    sa = list(range(len(tx)))
    sa.sort(key=lambda i: tx[i:i + step])
    grpstart = size * [False] + [True]  # a boolean map for iteration speedup.
    # It helps to skip yet resolved values. The last value True is a sentinel.
    rsa = size * [None]
    stgrp, igrp = '', 0
    for i, pos in enumerate(sa):
        st = tx[pos:pos + step]
        if st != stgrp:
            grpstart[igrp] = (igrp < i - 1)
            stgrp = st
            igrp = i
        rsa[pos] = igrp
        sa[i] = pos
    grpstart[igrp] = (igrp < size - 1 or size == 0)
    while grpstart.index(True) < size:
        # assert step <= size
        nextgr = grpstart.index(True)
        while nextgr < size:
            igrp = nextgr
            nextgr = grpstart.index(True, igrp + 1)
            glist = []
            for ig in range(igrp, nextgr):
                pos = sa[ig]
                if rsa[pos] != igrp:
                    break
                newgr = rsa[pos + step] if pos + step < size else -1
                glist.append((newgr, pos))
            glist.sort()
            for ig, g in groupby(glist, key=itemgetter(0)):
                g = [x[1] for x in g]
                sa[igrp:igrp + len(g)] = g
                grpstart[igrp] = (len(g) > 1)
                for pos in g:
                    rsa[pos] = igrp
                igrp += len(g)
        step *= 2
    del grpstart
    # create LCP array
    lcp = size * [None]
    h = 0
    for i in range(size):
        if rsa[i] > 0:
            j = sa[rsa[i] - 1]
            while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
                h += 1
            lcp[rsa[i]] = h
            if h > 0:
                h -= 1
    if size > 0:
        lcp[0] = 0
    return sa, rsa, lcp

I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.

The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.

I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.

the above link of the example with tests is broken. Could you please updated it? — gkoul, Jan 11 '18 at 01:32
I fixed the links to my code and to the original C by paste my copies. — hynekcer, Jan 12 '18 at 02:47

score 5 · Answer 2 · edited May 23 '17 at 12:34

The translation of the algorithm into Python:

from itertools import imap, izip, starmap, tee
from os.path   import commonprefix

def pairwise(iterable): # itertools recipe
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def longest_duplicate_small(data):
    suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
    return max(imap(commonprefix, pairwise(suffixes)), key=len)

buffer() allows to get a substring without copying:

def longest_duplicate_buffer(data):
    n = len(data)
    sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
    def lcp_item(i, j):  # find longest common prefix array item
        start = i
        while i < n and data[i] == data[i + j - start]:
            i += 1
        return i - start, start
    size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
    return data[start:start + size]

It takes 5 seconds on my machine for the iliad.mb.txt.

In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.

^{Note: *_memoryview() is deprecated by *_buffer() version}

More memory efficient version (compared to longest_duplicate_small()):

def cmp_memoryview(a, b):
    for x, y in izip(a, b):
        if x < y:
            return -1
        elif x > y:
            return 1
    return cmp(len(a), len(b))

def common_prefix_memoryview((a, b)):
    for i, (x, y) in enumerate(izip(a, b)):
        if x != y:
            return a[:i]
    return a if len(a) < len(b) else b

def longest_duplicate(data):
    mv = memoryview(data)
    suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
    result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
    return result.tobytes()

It takes 17 seconds on my machine for the iliad.mb.txt. The result is:

On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.

I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:

>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True

Related questions:

Find the longest repeating string and the number of times it repeats in a given string

finding long repeated substrings in a massive string

since your code requires python 3.+ and i don't have access to that version at the moment, could you please provide running time for my version of code in your environment as well? — lenik, Nov 27 '12 at 08:11
@lenik: The code works on Python 2.7. What could make you think that it were for Python 3? — jfs, Nov 27 '12 at 13:12
could you please stop arguing about unrelated things and just provide the running time? — lenik, Nov 27 '12 at 13:20
@lenik: if you can't run both Python 2.7 and 3. Here's the running time: 12 seconds. — jfs, Nov 27 '12 at 18:34
Side-note: The reason it produces an incorrect result on Python 2 (and an exception on Py3) is that `memoryview` only defines the equivalent of `__eq__` and `__ne__`, not the rest of the rich comparison operators; on Py2, this means it goes to the comparison of last resort (which ends up comparing the objects' memory addresses, totally useless), while Python 3 informs you that the comparison isn't supported. There is [a bug open to fix this](https://bugs.python.org/issue20399), but it's seen no action in the last five years. — ShadowRanger, Jan 18 '19 at 16:01

score 4 · Answer 3 · edited May 23 '17 at 12:02

4

The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551

You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).

I'm sure with a little work, you can get the rest working.

Edit:

~~The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:~~

#include <stdio.h>
#include <string.h>

int main() {
    char* test1 = "ovided by The Internet Classics Archive";
    char* test2 = "rovided by The Internet Classics Archive.";
    printf("%d\n", strcmp(test1, test2));
}

And compare the result to this python:

test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))

The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).

To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.

Edit:

The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.

FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):

print(cmp(memoryview(test1), memoryview(test2)))

I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.

edited May 23 '17 at 12:02

Community

1
1

answered Nov 26 '12 at 07:30

beatgammit

19,817
19
86
129

Thanks, tjameson! But even using `memoryview`, you still need to pass the string to `cmp`, right? Then it still need to pass-by-value? – Hanfei Sun Nov 26 '12 at 10:46
This one doesn't work. As `cmp` can't be used for `memoryview` object – Hanfei Sun Nov 26 '12 at 11:27
3

Bentley's code does *not* abuse `strcmp`. It just uses it to compare strings in `qsort`, which in turn never relies on anything but the *sign* of the return value. – Fred Foo Nov 26 '12 at 19:51
@larsmans - As mentioned in my comment, I realized this about 5 minutes after posting. Right about the time I stopped staring at the code... Revising answer. – beatgammit Nov 26 '12 at 20:07
@Firegun - `cmp` most definitely **does** work with `memoryview` objects. I'm guessing that there's something else at play here. – beatgammit Nov 26 '12 at 20:12
2

memoryview comparison doesn't work. See example in [my answer](http://stackoverflow.com/a/13574862/4279) – jfs Nov 26 '12 at 23:19
@tjameson: memoryview comparison works in PyPy 1.9, but it doesn't in CPython 2.7.3. – jfs Nov 26 '12 at 23:48
I've added [`buffer()`](http://docs.python.org/2/library/functions.html#buffer)-based version [that is 3-4 times faster than my previous `memoryview()`-based version](http://stackoverflow.com/a/13574862/4279). – jfs Nov 28 '12 at 00:01

score 0 · Answer 4 · answered Nov 26 '12 at 10:05

0

This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:

#!/usr/bin/env python

ex = open("iliad.mb.txt").read()

chains = dict()

# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
    s = ''.join(b)
    if s not in chains :
        chains[s] = list()

    chains[s].append(a)

def grow_chains(chains) :
    new_chains = dict()
    for (string,pos) in chains :
        offset = len(string)
        for p in pos :
            if p + offset >= len(ex) : break

            # add one more character
            s = string + ex[p + offset]

            if s not in new_chains :
                new_chains[s] = list()

            new_chains[s].append(p)
    return new_chains

# grow and filter, grow and filter
while len(chains) > 1 :
    print 'length of chains', len(chains)

    # remove chains that appear only once
    chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]

    print 'non-unique chains', len(chains)
    print [i[0] for i in chains[:3]]

    chains = grow_chains(chains)

The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...

answered Nov 26 '12 at 10:05

lenik

23,228
4
34
43

If more than one repeated substrings have the same maximal length nothing is returned. Example: `ex = 'ABCxABCyDEFzDEF'` – hynekcer Nov 26 '12 at 12:57
@hynekcer the last set always empty (that's the loop stopping condition), but the one before that contains: `['ABC', 'DEF']` -- i don't see why this is wrong? there are obvious limitations in my code -- only 3 first chains are printed, if there are more -- you have to modify the code or something, pretty printing was never my goal. – lenik Nov 26 '12 at 13:20
I expect that the result will be finally in the chain variable but they are lost. Debug printing is not important for an algorithm. – hynekcer Nov 26 '12 at 13:28
@hynekcer debug printing helps to understand how it works. if you need the answer only -- save the result of filtering in the temporary variable and when it's empty -- print whatever you have in `chains` -- that should work just fine for any number of substrings of any length. – lenik Nov 26 '12 at 13:31
The biggest problem is that your algorithm can require more than `N * N / 4` bytes of memory where N is the length of input string. Example: `ex = ' '.join('%03s' % i for i in range(500))` I print `sum(len(string) for string in chains)` and I see that the biggest value is 1001000. Required time is proportional to `N * N * N`. – hynekcer Nov 26 '12 at 13:47
whatever you say, it **works**. if you have a bigger/better/stronger solution, i'd be glad to see it and compare. – lenik Nov 26 '12 at 13:54
If we allow `O(N*N)` memory then you could use a simpler code. For example, see [`longest_duplicate_small()` in my answer](http://stackoverflow.com/a/13574862/4279). – jfs Nov 26 '12 at 23:22
@J.F.Sebastian `suffixes = sorted(data[i:] for i in xrange(len(data)))` just overflows the memory on my computer, it's definitely is not a working solution. have you ever had it running, and if yes -- how much memory did it really take (in gigabytes)? – lenik Nov 27 '12 at 05:05
@lenik: `_small` in the name stands for small input data. That is why there is the second example that requires less memory – jfs Nov 27 '12 at 05:10
@J.F.Sebastian you should not so easily believe in things other people say. my algorithm **does not** require O(N*N) memory, and your previous comment "if we allow..." is moot. – lenik Nov 27 '12 at 05:24
@lenik I write my algorithm and also a comparision of yours, the original and original in C. – hynekcer Dec 04 '12 at 01:32

Effcient way to find longest duplicate string for Python (From Programming Pearls)

4 Answers4

Linked