5

I'm looking for the most memory efficient way to solve this problem.

I have a list of tuples representing partial string matches in a sentence:

[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]

The first value of each tuple is the start position for the match, the second value is the length.

The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:

[(0,4), (2,6), (22,6)]

I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.

In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.

EDIT: Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.

Thanks!

Community
  • 1
  • 1
Francisco Roque
  • 431
  • 4
  • 14
  • 2
    possible duplicate of [Merging a list of time-range tuples that have overlapping time-ranges](http://stackoverflow.com/questions/5679638/merging-a-list-of-time-range-tuples-that-have-overlapping-time-ranges) – NPE May 28 '12 at 21:16
  • i imagine you'd use an interval tree, merging completely contained intervals on addition, but giving a complete worked answer is more than i can do right now. – andrew cooke May 28 '12 at 22:11
  • 1
    the suggested duplicate merges partial overlaps. i think this question is supposed to merge only proper inclusion (but am guessing). in other words, other answer gives `[(0,4)]` for `[(0,3),(1,4)]` while i guess this question expects both ranges to be returned. – andrew cooke May 28 '12 at 23:24
  • @andrew cooke: The OP wants the list collapsed so that only the longest continuous strings matches remain -- so I don't understand why you think both ranges would be in your two tuple example. – martineau May 29 '12 at 01:20
  • i don't know; it was a guess; i may be wrong. perhaps such things never happen in this case. my main concern is that this does not happen (partial overlap) in the example - that is why i think that maybe it is not wanted. i don't think your argument is convincing because the english is poor and the post confused. it would be good if the poster could clarify before the post is closed. – andrew cooke May 29 '12 at 01:55
  • @aix: that post merges the intervals in a list of lists, here I am searching across the entire list of tuples for values to be merged. – Francisco Roque May 29 '12 at 06:06
  • @andrewcooke: You are completely right, I have not explained myself properly, the result of merging `[(0,3),(1,4)]` should be `[(0,3),(1,4)]`. I will post an example in an edit above. – Francisco Roque May 29 '12 at 06:18

3 Answers3

4
import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
    start, length = lst[i]
    for j in xrange(i+1, len(lst)):
        lstart, llength = lst[j]
        if start >= lstart and start + length <= lstart + llength:
            del lst[i]
            break
print lst
#[(0, 4), (2, 6), (22, 6)]
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
1
a = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
b = [set(xrange(i, i + j)) for i, j in a]
c = b.pop().union(*b)
collapsed = sorted(c)
print collapsed
#Maybe this is useful?:
[0, 1, 2, 3, 22, 23, 24, 25, 26, 27]

#But if you want the requested format, then do this:
d = []
start = collapsed[0]
length = 0
for val in collapsed:
    if start + length < val:
        d.append((start,length))
        start = val
        length = 0
    elif val == collapsed[-1]:
        d.append((start,length + 1))
    length += 1
print d
#Output:
[(0,4), (22,6)]
fraxel
  • 34,470
  • 11
  • 98
  • 102
-1

So, taking you at your word that your main interest is space efficiency, here's one way to do what you want:

lst = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort()
start, length = lst.pop(0)
i = 0
while i < len(lst):
    x, l = lst[i]
    if start + length < x:
        lst[i] = (start, length)
        i += 1
        start, length = x, l
    else:
        length = max(length, x + l - start)
        lst.pop(i)
lst.append((start, length))

This modifies the list in place, never makes the list longer, only uses a small handful of variables to keep state, and only needs one pass through the list

A much faster algorithm is possible if you don't want to modify the list in place - popping items from the middle of a list can be slow, especially if the list is long.

One reasonable optimization would be to keep a list of which indices you're going to remove, and then come back and rebuild the list in a second pass, that way you could rebuild the whole list in one go and avoid the pop overhead. But that would use more memory!

Julian
  • 2,483
  • 20
  • 20
  • this gives `[(0,5)]` for `[(0,3),(1,4)]` which doesn't seem right whichever way you think partial overlaps should be handled. – andrew cooke May 28 '12 at 23:14