1

I have a sample dataframe which generically looks like this:

df = pd.Dataframe({'Class': [1, 2, 3, 2, 1, 2, 3, 2],
                   'Sex': [1, 0, 0, 0, 1, 1, 0, 1],
                   'Age': [15, 24, 13, 28, 29, NaN, 34, 27]})

Which displays as:

    Age  Class  Sex
0  15.0      1    1
1  24.0      2    0
2  13.0      2    0
3  28.0      2    0
4  29.0      1    1
5   NaN      2    1
6  34.0      1    0
7  27.0      2    1

What I'd like to do is fill in each of the NaN values in the 'Age' series with the median value for all entries that have their 'Class' and 'Sex' grouping.

So for example, when I access these values like so:

df.groupby(['Class', 'Sex'])['Age'].median()

and get:

  Class  Sex
   1      0      34.0
          1      22.0
   2      0      24.0
          1      27.0

I'd like to write a function that automatically fills the extant NaN value with 27 since that is the median of the entries that have a Class value of 2 and a Sex value of 1.

Right now I have:

df['Age'] = df.groupby(['Class', 'Sex'])['Age'].apply(lambda x: x.median() if pd.isnull(x) else x)

and am getting the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

However, a very similar syntax was used in the answers for this question and this one, so I'm not quite sure why mine doesn't work, particularly, the latter also uses the isnull method in its lambda function so it's not clear to me why mine doesn't work but that one does.

I've also tried using the fillna method like so:

df['Age'] = df['Age'].fillna(df.groupby(['Class', 'Sex'])['Age'].median())

But got the following error message:

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

I'm open to other methods which get the same value, but prefer something that relies entirely on Pandas methods without having to use a separate for-loop and passing it into the 'Apply' method to make it as concise as possible.

Thank you.

Community
  • 1
  • 1
Jonathan Bechtel
  • 3,497
  • 4
  • 43
  • 73

1 Answers1

2

One option would be to use transform to replace null values with median for the Age column:

df['Age'] = df.groupby(['Class', 'Sex']).Age.transform(lambda col: col.where(col.notnull(), col.median()))

df

#   Age Class   Sex
#0  15.0    1   1
#1  24.0    2   0
#2  13.0    3   0
#3  28.0    2   0
#4  29.0    1   1
#5  27.0    2   1
#6  34.0    3   0
#7  27.0    2   1

Alternatively use replace method instead of where also works:

df['Age'] = df.groupby(['Class', 'Sex']).Age.transform(lambda col: col.replace(np.nan, col.median()))
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thank you. Quick question: is col.notnull() supposed to be col.isnull() since I'm looking to replace the null values? – Jonathan Bechtel Sep 07 '16 at 19:05
  • 1
    `pandas.Series.where()` keeps the elements where the condition is true and replace values with the second argument where the condition is false, it's different from `replace()` method. You can read this for more information http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html – Psidom Sep 07 '16 at 19:08
  • Both of these methods work, but I'll add that they also work when using the apply method, albeit with a small performance penalty. So it was replacing the if/else statement with the where/replace methods that did the trick. Times for the where() method were: 19.5ms(apply), 15.2ms(transform). Times for the replace() method were: 13.4ms(apply), 8.53(transform). – Jonathan Bechtel Sep 07 '16 at 19:12
  • `apply` should work the same as `transform` here, the if/else is not vectorized in python so it doesn't accept array of conditions. Try `3 if np.array([1,2,2,3,3]) == 2 else 0` which gives exactly the same error message as you have seen. – Psidom Sep 07 '16 at 19:18
  • 3
    Will be even faster with `col.fillna(col.median())` – Zeugma Sep 07 '16 at 19:29