I've googled, I've tested, and this has me at my wits end. I have a list of numbers I need to group by similarity. For instance, in a list of [1, 6, 9, 100, 102, 105, 109, 134, 139], 1 6 9 would be put into a list, 100, 102, 105, and 109 would be put into a list, and 134 and 139. I'm terrible at math, and I've tried and tried this, but I can't get it to work. To be explicit as possible, I wish to group numbers that are within 10 values away from one another. Can anyone help? Thanks.
-
4You'll need to define "similarity" more precisely. Do you mean, have the same hundreds and tens digits? – Ned Batchelder Feb 09 '13 at 01:26
-
I mean, digits that are within 10(or however many) values of each other. Sorry, tried to put this as explicitly as possible. – Adam Magyar Feb 09 '13 at 01:28
-
1What if possible groups overlap? – millimoose Feb 09 '13 at 01:30
-
4Suppose you have `[56, 65, 66, 67]`. What would the groups be? – kojiro Feb 09 '13 at 01:32
3 Answers
There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:
def cluster(data, maxgap):
'''Arrange data into groups where successive elements
differ by no more than *maxgap*
>>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
[[1, 6, 9], [100, 102, 105, 109], [134, 139]]
>>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
[[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]
'''
data.sort()
groups = [[data[0]]]
for x in data[1:]:
if abs(x - groups[-1][-1]) <= maxgap:
groups[-1].append(x)
else:
groups.append([x])
return groups
if __name__ == '__main__':
import doctest
print(doctest.testmod())

- 216,523
- 63
- 388
- 485
This will find the groups:
nums = [1, 6, 9, 100, 102, 105, 109, 134, 139]
for k, g in itertools.groupby(nums, key=lambda n: n//10):
print k, list(g)
0 [1, 6, 9]
10 [100, 102, 105, 109]
13 [134, 139]
Note that if nums isn't actually sorted as your sample shows, you'll need to sort it first.

- 128,818
- 30
- 231
- 230

- 364,293
- 75
- 561
- 662
-
7The only thing I don't like about this approach is that ``[1, 6, 9, 99, 100, 134, 139]`` would group the *99* and *100* into different groups. It would be better to compute the differences between successive data points to determine where one cluster begins and the other ends. – Raymond Hettinger Feb 09 '13 at 01:39
-
1yeah unfortunately that's what happened when I tried this code ;/. Almost perfect. – Adam Magyar Feb 09 '13 at 01:44
-
1
-
You can make this work… see my answer. But I'm not sure it's what you want in this case. (I'd still write it as a sequence of iterator transformations, just not `groupby`.) – abarnert Mar 19 '13 at 04:33
First, you can easily convert any sequence into a sequence of pairs of adjacent items. Just tee it, shift it forward, and zip the unshifted and unshifted copies. The only trick is that you need to start with either (<something>, 1)
or (139, <something>)
, because in this case we want not each pair of elements, but a pair for each element:
def pairify(it):
it0, it1 = itertools.tee(it, 2)
first = next(it0)
return zip(itertools.chain([first, first], it0), it1)
(This isn't the simplest way to write it, but I think this may be the way that's most readable to people who aren't familiar with itertools
.)
>>> a = [1, 6, 9, 100, 102, 105, 109, 134, 139]
>>> list(pairify(a))
[(1, 1), (1, 6), (6, 9), (9, 100), (100, 102), (102, 105), (105, 109), (109, 134), (134, 139)]
Then, with a slightly more complicated version of Ned Batchelder's key, you can just use groupby
.
However, I think in this case this will end up being more complicated than an explicit generator that does the same thing.
def cluster(sequence, maxgap):
batch = []
for prev, val in pairify(sequence):
if val - prev >= maxgap:
yield batch
batch = []
else:
batch.append(val)
if batch:
yield batch

- 354,177
- 51
- 601
- 671