Counting elements with tolerance

Question

I have a long list of values (here below a shortened version) that I need to count:

ed = [ 0.52309  ,  3.1443  , 16.5789  , 24.0643  ,  9.70981 ,  1.71983 ,
       16.3453  , 14.1901  , 22.0353  ,  1.71983 , 15.0469  , 13.98    ,
       11.4753  , 32.7859  ,  9.7098  ,  6.36272 ,  3.2058  ,  1.46917 ,
        6.36271 , 11.5869  ,  1.72052 ,  6.32043 ,  1.72052 ,  1.72052 ,
        5.37679 ,  3.15279 ,  9.70979 ,  1.72052 ,  3.44035 ,  2.15729 ,
       12.0049  ]

and that I count with:

cnt = Counter(ed)
edlist = [list(i) for i in cnt.items()]

the list I obtain has some very similar values among the others

[[1.72052, 60], [1.71983, 34], [6.36271, 16], [9.7098, 14],[9.70979, 5], [0.52309, 3], [9.70981, 3]]

that I would like to add together within a given tolerance. For example

9.7098 has 16 counts
9.70981 has 3 counts
9.70979 has 5 counts

I would like to add all of them together to the item with the highest counts, and I am not sure if there is a function for that that allows to do that within some absolute or relative error. What I would like to obtain is

[[1.72052, 60], [1.71983, 34], [6.36271, 16], [9.7098, 22], [0.52309, 3]]

I have read the questions about grouping and clustering, but I do not know how to apply them. I need to count them with some given tolerance while keeping track of how many times each one has been found.

`1.72052` and `1.71983` are also close values, why aren't they added? what's the threshold? — RomanPerekhrest, Jan 16 '23 at 08:51
You need grouping/clustering. Then you can pick from a large number of similar questions: https://stackoverflow.com/questions/14783947 https://stackoverflow.com/questions/15800895 https://stackoverflow.com/questions/18364026 https://stackoverflow.com/questions/11513484 https://stackoverflow.com/questions/35094454 https://stackoverflow.com/questions/7869609 https://stackoverflow.com/questions/65425379 https://stackoverflow.com/questions/67240666 — tevemadar, Jan 16 '23 at 08:55
@RomanPerekhrest I want that to be a parameter to adjust as a relative error (5%, 1%) or an absolute threshold (0.001, 0.00005) — saimon, Jan 16 '23 at 09:08
You can use the clustering in all those links to group together values you consider similar and then count each group... How does that not answer your problem? — Tomerikoo, Jan 16 '23 at 09:30
@Tomerikoo because if I group data what I obtain is a group of 3 elements: [9.70979, 9.7098, 9.70981], and I lose information on the total number of counts. I guess I can count the occurrence of elements in each group in the original list, but I was wondering if there is a more elegant/efficient way to do that — saimon, Jan 16 '23 at 09:36
You can even save time and group after counting. You group by the key, and add the counts while doing it — Tomerikoo, Jan 16 '23 at 09:37
@Tomerikoo exactly what I want to do, I don't know how to do neither of the 2 operations -.-" — saimon, Jan 16 '23 at 09:40

Tomerikoo · Accepted Answer · 2023-01-16T11:05:20.543

You can cluster the counts according to their key, as described here using groupby. To do that you will have to sort the list first.

Then, sum the counts of each group and add it to the final list:

from itertools import groupby

l = [[1.72052, 60], [1.71983, 34], [6.36271, 16], [9.7098, 14], [9.70979, 5], [0.52309, 3], [9.70981, 3]]
l.sort(key=lambda x: x[0])

tolerance = 0.001

res = []
for key, group in groupby(l, lambda x: int(x[0]*(1/tolerance))):
    # for example: key = 9709, group = [[9.70979, 5], [9.7098, 14], [9.70981, 3]]
    group = list(group)
    res.append([max(group, key=lambda x: x[1])[0], sum(x[1] for x in group)])

print(res)

It is mostly playing around with lambdas using the key or the count as the key to the different functions.

Alternatively, you could cluster the data itself (not the counts) and the count is the size of each group:

from itertools import groupby

l = [0.52309, 3.1443, 16.5789, 24.0643, 9.70981, ...]
l.sort()

tolerance = 0.001

res = []
for key, group in groupby(l, lambda x: int(x*(1/tolerance))):
    res.append([key*tolerance, len(list(group))])

print(res)

In this case as we can't know the number with the most counts, the key is simply the normalized number according to the tolerance.

This worked perfectly after some minor edits. It is a bit of a black box to me, so I guess I need to study the lambda function to be able to exploit it — saimon, Jan 16 '23 at 10:39
@saimon This is why I added the comment showing the "insides" of `groupby`. It can be a little confusing at the start. I suggest you to add some print lines to understand the output of the `groupby` and start with simpler 1-D lists to understand the concept. Then, the `lambda`s are just a way to tell the different functions which element to use as the indicator. i.e. the sorting is done according the first element, the grouping as well but with a calculation. The max is giving the list with the maximum second element but taking the first and the sum is of all teh second elements — Tomerikoo, Jan 16 '23 at 11:04

Counting elements with tolerance

1 Answers1