1

I have a list of numbers looking like this:

numbers = [406.82, 406.93, 406.80, 406.89,
           443.22, 443.27, 
           415.01, 415.12, 415.2,
           443.71, 443.83,
           451.05, 451.14]

I want to group based on the how close they are:

numbers_grouped = [[406.82, 406.93, 406.80, 406.89]
                   [443.22, 443.27] 
                   [415.01, 415.12, 415.2]
                   [443.71, 443.83]
                   [451.05, 451.14]]

I tried this method but it doesn't seem to work,

  1. sorting it by ascending order
  2. then subtracting each number with its neighbouring numbers
  3. if the number is less than 0.1 then it will be grouped else not

But is there a better method to solve this problem?

Henul
  • 192
  • 1
  • 10
  • 2
    Why is 415.01 grouped with 415.12 when its more than 0.1 apart? Either way, you probably want something like https://stackoverflow.com/a/71678011/3483203 – user3483203 Aug 31 '22 at 14:22
  • Your method is good, but the threshold you use, 0.1, is too arbitrary. You need to find a way to calculate an appropriate threshold to better fit your data. – Stef Aug 31 '22 at 14:23
  • Related: [stackoverflow: Clustering values by their proximity in python](https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning), [pypi: kmeans1d](https://pypi.org/project/kmeans1d/), [stats.stackexchange: How to find the number of clusters in 1d data and the mean of each](https://stats.stackexchange.com/questions/79314/how-to-find-the-number-of-clusters-in-1d-data-and-the-mean-of-each) – Stef Aug 31 '22 at 14:33
  • With your data, any threshold between 0.45 and 7.2 would work. But 0.1 is too small. – Stef Aug 31 '22 at 14:34
  • See also this: [How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?](https://stackoverflow.com/a/35151947/3080723) – Stef Aug 31 '22 at 14:38
  • @user3483203 that exactly it! – Henul Aug 31 '22 at 16:12

1 Answers1

-1

Your method should actually work well. As you've tagged it with I assume this is the library you want to use. We can easily find the "boundaries" where we have to cut up the sorted list by using diff, and then use cumsum to find the group index of each element. It is not extremely efficient but quite concise. Note that the output is a list of numpy arrays, as numpy arrays cannot be jagged:

import numpy as np
numbers = [406.82, 406.93, 406.80, 406.89,
           443.22, 443.27,
           415.01, 415.12, 415.2,
           443.71, 443.83,
           451.05, 451.14]
numbers.sort()
num = np.array(numbers)
groups = np.concatenate([[0], np.cumsum(np.diff(num) > 0.1)])  # compute indices 
grouped = [num[groups == ind] for ind in range(groups.max())]  # extract groups
print(grouped)  # list of numpy arrays, 
flawr
  • 10,814
  • 3
  • 41
  • 71
  • This answer uses the same arbitrary threshold 0.1 as the OP, which was actually the source of the issue. For instance, 443.71 and 443.83 won't be grouped together, because 443.83 - 443.71 = 0.12 > 0.1. – Stef Aug 31 '22 at 15:51
  • A possible solution would be to examine the values in `np.diff(num)` and try to automatically determine an appropriate threshold from those values. – Stef Aug 31 '22 at 15:52
  • Thank you, I didn't see that particular point, I interpreted OPs question as a technical one about the implementation. I agree another method might be more suitable for the problem, but without OP providing more information about the problem I don't think it make sense to suggest other approaches. – flawr Aug 31 '22 at 15:56
  • Thank you @flawr for the answer but, this is what I was looking for https://stackoverflow.com/a/71678011/3483203 – Henul Aug 31 '22 at 16:12
  • This is a direct copy of a linked answer – user3483203 Aug 31 '22 at 16:28
  • @user3483203 I'm sorry I didn't see any comments yet when I opened the question, but I agree it is pretty much *the* most concise way of doing it indeed. Note that in contrast to the answer you linked you can use `np.diff` as in my answer, you don't have to manually reimplement it. – flawr Aug 31 '22 at 18:02