1

I have a csv file which has thousands of floating point values arranged in ascending order. I want to bunch/cluster those values into suitable clusters.

for example :
0.001
0.002
0.013
0.1
0.101
0.12
0.123
0.112
0.113
0.2

so the clusters should be like

0 - 0.1 with count 4
0.1 - 0.2 with count 6

How can I do this clustering task automatically in Python? Do I need to keep some initial parameters?

halfer
  • 19,824
  • 17
  • 99
  • 186
POOJA GUPTA
  • 2,295
  • 7
  • 32
  • 60
  • 2
    did you try this? http://stackoverflow.com/questions/14783947/grouping-clustering-numbers-in-python... – jsh Oct 03 '15 at 15:54
  • 1
    How high do the values go? – Padraic Cunningham Oct 03 '15 at 17:50
  • @PadraicCunningham : sorry for replying late, it is dynamic, the values can go to any level but some of the ranges will be there like 0.1, 0.2 but I wanted some way by which it could decide these ranges .. :) but with your answer I will try relating to what I want ... thank you so much for your early response .. if you can still tell me if what I want is possible then please do let me know .. :) – POOJA GUPTA Oct 04 '15 at 16:14
  • 1
    @POOJAGUPTA, it just really depends on what the increments can be, say we get to 1.0,10.0 or 100.0 what should the steps be after that? – Padraic Cunningham Oct 04 '15 at 16:18

1 Answers1

2

You can bisect.bisect_left to find where the element would land in a list of keys with the correct increment, then simply use that index to get the element from the list of keys and increment its count using a dict.

from bisect import bisect_left
with open("test.txt") as f:
    keys = [0.1, 0.2]
    d = dict.fromkeys(keys, 0)
    for line in f:
        ind = bisect_left(keys, float(line))
        d[keys[ind]] += 1
print(d)
{0.1: 4, 0.2: 6}

Another way would be to round by an appropriate amount:

with open("test.txt") as f:
    keys = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    d = dict.fromkeys(keys, 0)
    for flt in map(float, f):
        k = round(flt + .05, 1) if flt > .05 else .1
        if flt not in d:
            d[k] += 1
        else:
            d[flt] += 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321