0

I'm looking to fill in missing values of one column with the mode of the value from another column. Let's say this is our data set (borrowed from Chris Albon):

import pandas as pd
import numpy as np

raw_data = {'first_name': ['Jake', 'Jake', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze'], 
        'age': [42, np.nan, 36, 24, 73], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

I know we can fill in missing postTestScore with each sex's mean value of postTestScore with:

df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True) df

But how would we fill in missing sex with each first name's mode value of sex (obviously this is not politically correct, but as an example this was an easy data set to use). So for this example the missing sex value would be 'm' because there are two Jake's with the value 'm'. If there were a Jake with value 'f' it would still pick 'm' as the mode value because 2 > 1. It would be nice if you could do:

df["sex"].fillna(df.groupby("first_name")["sex"].transform("mode"), inplace=True) df

I looked into value_counts and apply but couldn't find this specific case. My ultimate goal is to be able to look at one column and if that doesn't have a mode value then to look at another column for a mode value.

d84_n1nj4
  • 1,712
  • 6
  • 23
  • 40

1 Answers1

1

You need call the mode function with pd.Series.mode

df.groupby("first_name")["sex"].transform(pd.Series.mode)
Out[432]: 
0    m
1    m
2    f
3    m
4    f
Name: sex, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • How would I use this with `fillna` so that I can look at the `NaN`'s that are left and try to fill them in with say 'last_name' if possible. – d84_n1nj4 Jun 01 '18 at 03:00
  • I'm using code like this but for a larger data frame; `df["sex"].fillna(df.groupby("first_name")["sex"].transform(pd.Series.mode), inplace = True)` and `df["sex"].fillna(df.groupby("last_name")["sex"].transform(pd.Series.mode), inplace = True)` but I get a warning `SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame` – d84_n1nj4 Jun 01 '18 at 03:11
  • @d84_n1nj4 https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – BENY Jun 01 '18 at 03:20
  • It seems that I can ignore the warning for what I'm trying to achieve, correct? – d84_n1nj4 Jun 01 '18 at 11:51
  • @d84_n1nj4 yes :-) – BENY Jun 01 '18 at 12:27
  • I upgraded my conda packages and now I receive this error: `ValueError: Length of passed values is 1, index implies 3` Any ideas? – d84_n1nj4 Jun 02 '18 at 21:02
  • Agreed, but now I wonder if the answer provided is correct or there is a bug from the upgrade. – d84_n1nj4 Jun 02 '18 at 21:11
  • @d84_n1nj4 it should be correct , when you do subset df, you can try do with df1=df.loc[].copy() – BENY Jun 02 '18 at 21:49
  • I created a new question if you want to look at it: https://stackoverflow.com/questions/50661270/python-pandas-series-value-error-length-of-passed-values-is-x-index-implies-y#50661332 – d84_n1nj4 Jun 02 '18 at 21:51