0

I am trying to calculate the mean, mode and median for 2 columns in a dataframe. I want to groupby COL1 and calculate the mean, mode and median for COL2. There are numerous columns and rows but an example of the 2 I wish to use are shown below below

    COL1      COL2    
1   (0-9)      64  
2   (10-19)    74   
3   (20-29)    64
4   (0-9)      62

I have been trying code as below but continually get errors:

data.groupby('COL1')['COL2'].mean()

If I do a groupby with just COL1 this returns the mean (and median but not mode) for 2 other columns fine but not COL2:

data1 = data.groupby(['COL1']).mean()

I think there may be an issue with COL2 showing as an object but I am having difficulty in fixing this. An example error is shown below:

pandas.core.base.DataError: No numeric types to aggregate

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
Stealth
  • 63
  • 11

1 Answers1

2

You can set the type to int and then use pd.Series.mode to compute mode (along with mean and median)

df = df[df.notna()] #Handling nans
df['COL2'] = df['COL2'].astype(float)

mode = pd.Series.mode

#INCASE OF ValueError: Must produce aggregated value, Use this for mode
#mode = lambda x: pd.Series.mean(pd.Series.mode(x))

out = df.groupby("COL1")["COL2"].agg(['mean','median', mode])
print(out)

In case a group doesn't have a single mode, you can take the mean of the modes returned by the pd.Series.mode. Simply replace the mode with the commented lambda function.

The mean of a single value will not change the mode. Feel free to change the function to what your definition of mode would be incased a single mode is not found.

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • Thanks but the first line of that is causing an error, shown below. That caused me to look again at the CSV this data originally comes from I've noticed there is some blank data. I tried filling this in with null and when it is both blank and null I get the following: ValueError: invalid literal for int() with base 10: 'null' – Stealth Jan 08 '21 at 18:59
  • You have null values in data. Add df = df.notna() before the first line. – Akshay Sehgal Jan 08 '21 at 19:05
  • I have tried that but now all my other columns are now returning a bool, so COL1 now shows: False True When it should be: COL1 (0-9) (10-19) (20-29) (0-9) – Stealth Jan 08 '21 at 20:42
  • Updated my answer please check. Also please read [pandas documentation](https://pandas.pydata.org/docs/) for knowing how to debug code. thanks! – Akshay Sehgal Jan 08 '21 at 20:46
  • Do mark the answer as correct if it helped you solve the question, this encourages me to put in time to helping you with future problems. – Akshay Sehgal Jan 08 '21 at 20:47
  • Sorry I really appreciate your help but the updated answer does not work for me and I am now again getting: ValueError: invalid literal for int() with base 10: 'null' – Stealth Jan 08 '21 at 21:21
  • Try now, changed int to float. You can just google the problem to get a solution for the Valueerror. check this [link](https://stackoverflow.com/questions/1841565/valueerror-invalid-literal-for-int-with-base-10) – Akshay Sehgal Jan 08 '21 at 21:23
  • Thanks for all of your help Akshay. Although I still cant get this working in my situation I have tested this with a subset of my data and your solution works. I am obviously having deeper problems with the data and csv that I need to solve. Thanks again. – Stealth Jan 09 '21 at 16:04
  • 1
    No worries. Try to write another question around those issues you are facing. Maybe more people can help solve them as well. Thanks ! – Akshay Sehgal Jan 09 '21 at 16:14