Identify groups of varying continuous numbers in a list

Question

In this other SO post, a Python user asked how to group continuous numbers such that any sequences could just be represented by its start/end and any stragglers would be displayed as single items. The accepted answer works brilliantly for continuous sequences.

I need to be able to adapt a similar solution but for a sequence of numbers that have potentially (not always) varying increments. Ideally, how I represent that will also include the increment (so they'll know if it was every 3, 4, 5, nth)

Referencing the original question, the user asked for the following input/output

[2, 3, 4, 5, 12, 13, 14, 15, 16, 17, 20]  # input
[(2,5), (12,17), 20]

What I would like is the following (Note: I wrote a tuple as the output for clarity but xrange would be preferred using its step variable):

[2, 3, 4, 5, 12, 13, 14, 15, 16, 17, 20]  # input
[(2,5,1), (12,17,1), 20]  # note, the last element in the tuple would be the step value

And it could also handle the following input

[2, 4, 6, 8, 12, 13, 14, 15, 16, 17, 20]  # input
[(2,8,2), (12,17,1), 20]  # note, the last element in the tuple would be the increment

I know that xrange() supports a step so it may be possible to even use a variant of the other user's answer. I tried making some edits based on what they wrote in the explanation but I wasn't able to get the result I was looking for.

For anyone that doesn't want to click the original link, the code that was originally posted by Nadia Alramli is:

ranges = []
for key, group in groupby(enumerate(data), lambda (index, item): index - item):
    group = map(itemgetter(1), group)
    if len(group) > 1:
        ranges.append(xrange(group[0], group[-1]))
    else:
        ranges.append(group[0])

Moses Koledoye · Answer 1 · 2016-09-27T16:51:46.707

The itertools pairwise recipe is one way to solve the problem. Applied with itertools.groupby, groups of pairs whose mathematical difference are equivalent can be created. The first and last items of each group are then selected for multi-item groups or the last item is selected for singleton groups:

from itertools import groupby, tee, izip


def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def grouper(lst):
    result = []
    for k, g in groupby(pairwise(lst), key=lambda x: x[1] - x[0]):
        g  = list(g)
        if len(g) > 1:
            try:
                if g[0][0] == result[-1]:
                    del result[-1]
                elif g[0][0] == result[-1][1]:
                    g = g[1:] # patch for duplicate start and/or end
            except (IndexError, TypeError):
                pass
            result.append((g[0][0], g[-1][-1], k))
        else:
            result.append(g[0][-1]) if result else result.append(g[0])
    return result

Trial: input -> grouper(lst) -> output

Input: [2, 3, 4, 5, 12, 13, 14, 15, 16, 17, 20]
Output: [(2, 5, 1), (12, 17, 1), 20]

Input: [2, 4, 6, 8, 12, 13, 14, 15, 16, 17, 20]
Output: [(2, 8, 2), (12, 17, 1), 20]

Input: [2, 4, 6, 8, 12, 12.4, 12.9, 13, 14, 15, 16, 17, 20]
Output: [(2, 8, 2), 12, 12.4, 12.9, (13, 17, 1), 20] # 12 does not appear in the second group

Update: (patch for duplicate start and/or end values)

s1 = [i + 10 for i in xrange(0, 11, 2)]; s2 = [30]; s3 = [i + 40 for i in xrange(45)]

Input: s1+s2+s3
Output: [(10, 20, 2), (30, 40, 10), (41, 84, 1)]

# to make 30 appear as an entry instead of a group change main if condition to len(g) > 2
Input: s1+s2+s3
Output: [(10, 20, 2), 30, (41, 84, 1)]

Input: [2, 4, 6, 8, 10, 12, 13, 14, 15, 16, 17, 20]
Output: [(2, 12, 2), (13, 17, 1), 20]

In the case of the following sequence, s1 = [i + 10 for i in xrange(0, 11, 2)]; s2 = [30]; s3 = [i + 40 for i in xrange(45)]; grouper(s1 + s2 + s3) grouper() produces duplicate start-end handles, [(10, 20, 2), (20, 40, 10), (40, 84, 1)]. Notice that 20 and 40 are listed twice. See my comment to Padraic for more info on that. Do you have a solution to keep duplicates like that from appearing? — ColinKennedy, Sep 27 '16 at 14:59
@user3626104 Good edge case you've got. I've updated my answer with a patch that fixes that. — Moses Koledoye, Sep 27 '16 at 16:01
I got one more for you, if you're up for it. Outliers that begin sequences are excluded from your results s1 = [4, 8, 10, 12, 14] grouper(s1) # (8, 14, 2) - the 4 is missing. If you need another example - s1 = [3, 4, 8, 10, 12, 14] grouper(s1) # (4, (8, 14, 2)) - the 3 is missing — ColinKennedy, Sep 27 '16 at 16:02
That case is not well defined. For `[4, 8, 10, 12, 14] -> [(4, 8), (10, 14, 2)]`, 4 won't stand alone as the first group will always be valid, the code is not clairvoyant as to detect a future pattern before grouping 4 and 8. — Moses Koledoye, Sep 27 '16 at 16:22
If you looking to not put 30 in a group for the first test case, change `if len(g) > 1 -> if len(g) > 2` — Moses Koledoye, Sep 27 '16 at 16:39

score 2 · Answer 2 · answered Sep 26 '16 at 18:43

Here is a quickly written (and extremely ugly) answer:

def test(inArr):
    arr=inArr[:] #copy, unnecessary if we use index in a smart way
    result = []
    while len(arr)>1: #as long as there can be an arithmetic progression
        x=[arr[0],arr[1]] #take first two
        arr=arr[2:] #remove from array
        step=x[1]-x[0]
        while len(arr)>0 and x[1]+step==arr[0]: #check if the next value in array is part of progression too
            x[1]+=step #add it
            arr=arr[1:]
        result.append((x[0],x[1],step)) #append progression to result
    if len(arr)==1:
        result.append(arr[0])
    return result

print test([2, 4, 6, 8, 12, 13, 14, 15, 16, 17, 20])

This returns [(2, 8, 2), (12, 17, 1), 20]

Slow, as it copies a list and removes elements from it

It only finds complete progressions, and only in sorted arrays.

In short, it is shitty, but should work ;)

There are other (cooler, more pythonic) ways to do this, for example you could convert your list to a set, keep removing two elements, calculate their arithmetic progression and intersect with the set.

You could also reuse the answer you provided to check for certain step sizes. e.g.:

ranges = []
step_size=2
for key, group in groupby(enumerate(data), lambda (index, item): step_size*index - item):
    group = map(itemgetter(1), group)
    if len(group) > 1:
        ranges.append(xrange(group[0], group[-1]))
    else:
        ranges.append(group[0])

Which finds every group with step size of 2, but only those.

Padraic Cunningham · Accepted Answer · 2016-09-27T17:25:41.893

2

You can create an iterator to help grouping and try to pull the next element from the following group which will be the end of the previous group:

def ranges(lst):
    it = iter(lst)
    next(it)  # move to second element for comparison
    grps = groupby(lst, key=lambda x: (x - next(it, -float("inf"))))
    for k, v in grps:
        i = next(v)
        try:
            step = next(v) - i  # catches single element v or gives us a step
            nxt = list(next(grps)[1])
            yield xrange(i, nxt.pop(0), step)
            # outliers or another group
            if nxt:
                yield nxt[0] if len(nxt) == 1 else xrange(nxt[0], next(next(grps)[1]), nxt[1] - nxt[0])
        except StopIteration:
            yield i  # no seq

which give you:

In [2]: l1 = [2, 3, 4, 5, 8, 10, 12, 14, 13, 14, 15, 16, 17, 20, 21]

In [3]: l2 = [2, 4, 6, 8, 12, 13, 14, 15, 16, 17, 20]

In [4]: l3 = [13, 14, 15, 16, 17, 18]

In [5]: s1 = [i + 10 for i in xrange(0, 11, 2)]

In [6]: s2 = [30]

In [7]: s3 = [i + 40 for i in xrange(45)]

In [8]: l4 = s1 + s2 + s3

In [9]: l5 = [1, 2, 5, 6, 9, 10]

In [10]: l6 = {1, 2, 3, 5, 6, 9, 10, 13, 19, 21, 22, 23, 24}

In [11]: 

In [11]: for l in (l1, l2, l3, l4, l5, l6):
   ....:         print(list(ranges(l)))
   ....:     
[xrange(2, 5), xrange(8, 14, 2), xrange(13, 17), 20, 21]
[xrange(2, 8, 2), xrange(12, 17), 20]
[xrange(13, 18)]
[xrange(10, 20, 2), 30, xrange(40, 84)]
[1, 2, 5, 6, 9, 10]
[xrange(1, 3), 5, 6, 9, 10, 13, 19, xrange(21, 24)]

When the step is 1 it is not included in the xrange output.

edited Sep 27 '16 at 17:25

answered Sep 26 '16 at 21:12

Padraic Cunningham

176,452
29
245
321

1

I've found a couple bugs after testing. In cases of varying discontinuous sequential ranges (what a mouthful!) it doesn't yield correct results. s1 = [i + 10 for i in xrange(0, 11, 2)]; s2 = [30]; s3 = [i + 40 for i in xrange(45)]; ranges(s1 + s2 + s3) notice that 30 is completely missing in the output. I assume this is because 30 is halfway between the end and start of the first and second sequence so ranges() doesn't recognize it as an outlier. Setting to 29/31 works as expected, however. – ColinKennedy Sep 27 '16 at 14:51
Will have a look when I get back on my comp. – Padraic Cunningham Sep 27 '16 at 15:23
1

@user3626104, you need to outline what you consider a sequence, how could `(30, 40, 10)` be a valid sequence when there are only two elements? Is `[xrange(10, 20, 2), 30, xrange(40, 84)]` not a more correct solution? – Padraic Cunningham Sep 27 '16 at 16:06
I've been debating with my coworkers about this exact point a little while ago. The long and short of it is that it's not valid. Your representation is more correct. However, no data can be excluded in the output, like in the case of varying discontinuous sequential ranges. (having 1 or more outlier that lies in the middle of two valid sequences). This type of data comes up a lot in our sample sets so it needs to be included. Thank you for letting me clarify that point. A sequence should contain 3 in order to be a trend – ColinKennedy Sep 27 '16 at 16:13
@user3626104, try the edit. Maybe adding a few test samples with expected output would help. – Padraic Cunningham Sep 27 '16 at 16:22
No worries, you're welcome. It was an interesting problem. – Padraic Cunningham Oct 03 '16 at 09:11

score 0 · Answer 4 · answered Jun 10 '19 at 11:54

I came across such a case once. Here it goes.

import more_itertools as mit
iterable = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17, 20]  # input
x = [list(group) for group in mit.consecutive_groups(iterable)]
output = [(i[0],i[-1]) if len(i)>1 else i[0] for i in x]
print(output)

Identify groups of varying continuous numbers in a list

4 Answers4

Linked

Related