Grouping / clustering numbers in Python

Question

I've googled, I've tested, and this has me at my wits end. I have a list of numbers I need to group by similarity. For instance, in a list of [1, 6, 9, 100, 102, 105, 109, 134, 139], 1 6 9 would be put into a list, 100, 102, 105, and 109 would be put into a list, and 134 and 139. I'm terrible at math, and I've tried and tried this, but I can't get it to work. To be explicit as possible, I wish to group numbers that are within 10 values away from one another. Can anyone help? Thanks.

You'll need to define "similarity" more precisely. Do you mean, have the same hundreds and tens digits? — Ned Batchelder, Feb 09 '13 at 01:26
I mean, digits that are within 10(or however many) values of each other. Sorry, tried to put this as explicitly as possible. — Adam Magyar, Feb 09 '13 at 01:28
Suppose you have `[56, 65, 66, 67]`. What would the groups be? — kojiro, Feb 09 '13 at 01:32

Raymond Hettinger · Accepted Answer · 2013-02-09T01:40:55.197

There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        >>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        >>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

if __name__ == '__main__':
    import doctest
    print(doctest.testmod())

score 8 · Answer 2 · edited Mar 19 '13 at 02:58

8

This will find the groups:

nums = [1, 6, 9, 100, 102, 105, 109, 134, 139]
for k, g in itertools.groupby(nums, key=lambda n: n//10):
    print k, list(g)

0 [1, 6, 9]
10 [100, 102, 105, 109]
13 [134, 139]

Note that if nums isn't actually sorted as your sample shows, you'll need to sort it first.

edited Mar 19 '13 at 02:58

jamylak

128,818
30
231
230

answered Feb 09 '13 at 01:29

Ned Batchelder

364,293
75
561
662

7

The only thing I don't like about this approach is that ``[1, 6, 9, 99, 100, 134, 139]`` would group the *99* and *100* into different groups. It would be better to compute the differences between successive data points to determine where one cluster begins and the other ends. – Raymond Hettinger Feb 09 '13 at 01:39
1

yeah unfortunately that's what happened when I tried this code ;/. Almost perfect. – Adam Magyar Feb 09 '13 at 01:44
1

Yes, it was underspecified when I wrote it. – Ned Batchelder Feb 09 '13 at 02:12
You can make this work… see my answer. But I'm not sure it's what you want in this case. (I'd still write it as a sequence of iterator transformations, just not `groupby`.) – abarnert Mar 19 '13 at 04:33

score 3 · Answer 3 · answered Mar 19 '13 at 04:30

First, you can easily convert any sequence into a sequence of pairs of adjacent items. Just tee it, shift it forward, and zip the unshifted and unshifted copies. The only trick is that you need to start with either (<something>, 1) or (139, <something>), because in this case we want not each pair of elements, but a pair for each element:

def pairify(it):
    it0, it1 = itertools.tee(it, 2)
    first = next(it0)
    return zip(itertools.chain([first, first], it0), it1)

(This isn't the simplest way to write it, but I think this may be the way that's most readable to people who aren't familiar with itertools.)

>>> a = [1, 6, 9, 100, 102, 105, 109, 134, 139]
>>> list(pairify(a))
[(1, 1), (1, 6), (6, 9), (9, 100), (100, 102), (102, 105), (105, 109), (109, 134), (134, 139)]

Then, with a slightly more complicated version of Ned Batchelder's key, you can just use groupby.

However, I think in this case this will end up being more complicated than an explicit generator that does the same thing.

def cluster(sequence, maxgap):
    batch = []
    for prev, val in pairify(sequence):
        if val - prev >= maxgap:
            yield batch
            batch = []
        else:
            batch.append(val)
    if batch:
        yield batch

Grouping / clustering numbers in Python

3 Answers3

Linked

Related