1

I am working on a kMeans AI to determine the season of any given day For this, I have an array with data with 4 columns This is what it looks like (tho its longer):

['0.2742330338168506' '0' '1.3694492732480696' 'winter']
['0.28529288153011745' '0' '1.3805091209613365' 'lente']
['0.28595917620794253' '1' '1.3811754156391616' 'winter']
['0.2874392369724381' '2' '1.3826554764036572' 'lente']
['0.316557712713994' '2' '1.411773952145213' 'herfst']
['0.32113534393276466' '3 '1.4163515833639837' 'lente']
['0.3231108855082745' '3' '1.4220488660040091' 'lente']
['0.3163219663513872' '3' '1.4288377851608964' 'winter']
['0.31201423701381703' '4' '1.4331455144984666' 'lente']
['0.3081781460867783' '4' '1.4369816054255053' 'lente']
['0.29534720251567403' '4' '1.4498125489966096' 'winter']

Now I know how to find the most common item in the entire array, like so

Counter(array.flat).most_common()

But for this one I need the most common item in the 4th column per cluster, which is the value in the second column, is there an easier way to do this besides making a long for loop and counting them all?

nik47
  • 17
  • 4
  • try putting it in pandas df with columns ('a', 'b', 'c', 'd') then run: `import pandas as pd;df.groupby('b')['d'].agg(pd.Series.mode)` – Matt Feb 18 '21 at 10:02
  • @Matt Have a look at this on why mode() doesn't work with groupby() :https://github.com/pandas-dev/pandas/issues/13809 – Ishwar Venugopal Feb 18 '21 at 10:06
  • that's interesting, maybe just change it to a mode function that will work then: `from scipy.stats import mode;df.groupby('b')['d'].apply(mode)` – Matt Feb 18 '21 at 11:34

1 Answers1

1

For some reason the solution suggested in the comments throws a ValueError. So here's an alternate solution using pandas:

import pandas as pd

data = [] #A nested list for data shown in your question
df = pd.DataFrame(data,  columns = ['val1','cluster', 'val2','season']) #read your data into a dataframe
def print_mode(group):
    print("{} - {}".format(group['cluster'].values[0], group['season'].mode().values))
    
df.groupby('cluster').apply(print_mode)

Sample output for your example data would be :

0 - ['lente' 'winter']
1 - ['winter']
2 - ['herfst' 'lente']
3 - ['lente']
4 - ['lente']

Instead of printing it, you can use it however you wish depending on your use-case.

Ishwar Venugopal
  • 872
  • 6
  • 17
  • 1
    Thanks =D This works great, also helped me to find a numpy solution https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function/38015063 – Floris Fancypants Feb 18 '21 at 11:43