6

The problem: let us take Titanic dataset from Kaggle. I have dataframe with columns "Pclass", "Sex" and "Age". I need to fill NaN in column "Age" with a median for certain group. If it is a woman from 1st class, I would like to fill her age with the median for 1st class women, not with the median for whole Age column.

The question is how to make this change in a certain slice?

I tried:

data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)

where the "median" is my value, but nothing changes "inplace=True" didn't help.

Thanks alot!

  • I add solution for filling `NaN`s by median per group, is necessary only `groupby`. Check edit in my answer. – jezrael Nov 24 '17 at 08:29

3 Answers3

6

I believe you need filter by masks and assign back:

data = pd.DataFrame({'a':list('aaaddd'),
                     'Sex':['female','female','male','female','female','male'],
                     'Pclass':[1,2,1,2,1,1],
                     'Age':[40,20,30,20,np.nan,np.nan]})

print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4   NaN       1  female  d
5   NaN       1    male  d

#boolean mask
mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)

#get median by mask without NaNs
med = data.loc[mask1, 'Age'].median()
print (med)
40.0

#repalce NaNs
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5   NaN       1    male  d

What is same as:

mask2 = mask1 &(data['Age'].isnull())

data.loc[mask2, 'Age'] = med
print (data)
    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5   NaN       1    male  d

EDIT:

If need replace all groups NaNs by median:

data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
print (data)

    Age  Pclass     Sex  a
0  40.0       1  female  a
1  20.0       2  female  a
2  30.0       1    male  a
3  20.0       2  female  d
4  40.0       1  female  d
5  30.0       1    male  d
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Yes, exactly. Because eklse get NaN. – jezrael Nov 23 '17 at 14:50
  • 1
    You are welcome! Maybe small advice to future - [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) ;) – jezrael Nov 23 '17 at 15:01
  • 1
    I see your question in comment and try answer. `for` loops are possible in `pandas`, but it is slow. `Jeff`, one of main developer of pandas at the moment, post very nice answer - check [here](https://stackoverflow.com/a/24871316/2901002). – jezrael Nov 24 '17 at 08:43
2

I want to add an answer here that is a bit more efficient, as it involves a little less code. Essentially, if you are are using boolean conditions to slice your dataframe and are using .fillna under those specific conditions, just use an assignment:

I will use an example from a different Kaggle competition:

# Use a mask as suggested by jesrael. It's just neater:
mask1 = (test_df.Neighborhood == 'IDOTRR') & (test_df.MSZoning.isna())
mask2 = (test_df.Neighborhood == 'Mitchel') & (test_df.MSZoning.isna())

# Use the mask and assign the desired value 
test_df.loc[mask1, 'MSZoning'] = 'RM'
test_df.loc[mask2, 'MSZoning'] = 'RL'

This is different than jesrael's answer in that he/she uses .fillna() to assign back to the masked dataframe. If you are going to use a mask, and have a specific value in mind, there is no need to use '.fillna()'

rocksNwaves
  • 5,331
  • 4
  • 38
  • 77
1

In case you want to do the same for every groups you can use this trick

data = pd.DataFrame({'a':list('aaaddd'),
                    'Sex':['female','female','male','female','female','male'],
                    'Pclass':[1,2,1,2,1,1],
                    'Age':[40,20,30,20, np.nan, np.nan]})
df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
df.rename(columns={"Age":"Med"}, inplace=True)
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])

UPDATE:

# dummy dataframe
n = int(1e7)
data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
                     "Pclass":np.random.choice([1,2,3], n),
                     "Sex":np.random.choice(["male","female"], n),
                     "a":np.random.choice(["a","b","c","d"], n)})

In my machine running this (is as the previous without renaming)

df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])

CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
Wall time: 2.2 s

While the mask solution took:

for sex in ["male", "female"]:
    for pclass in range(1,4):
        mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
        med = data.loc[mask1, 'Age'].median()
        data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)

CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
Wall time: 5.19 s

@jezrael solution is even faster

data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))

CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
Wall time: 1.44 s
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Thanks for ur answer! I have an additional question. I used another way using for-loops. But I think ur way could be better because it uses numpy and pandas which is C++ and must be faster if will be used on large datasets, am I right? – George Vdovychenko Nov 24 '17 at 08:37