4

I have a pandas DataFrame with two columns: toy and color. The color column includes missing values.

How do I fill the missing color values with the most frequent color for that particular toy?

Here's the code to create a sample dataset:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'toy':['car'] * 4 + ['train'] * 5 + ['ball'] * 3 + ['truck'],
    'color':['red', 'blue', 'blue', np.nan, 'green', np.nan,
             'red', 'red', np.nan, 'blue', 'red', np.nan, 'green']
    })

Here's the sample dataset:

      toy  color
0     car    red
1     car   blue
2     car   blue
3     car    NaN
4   train  green
5   train    NaN
6   train    red
7   train    red
8   train    NaN
9    ball   blue
10   ball    red
11   ball    NaN
12  truck  green

Here's the desired result:

  • Replace the first NaN with blue, since that is the most frequent color for a car.
  • Replace the second and third NaNs with red, since that is the most frequent color for a train.
  • Replace the fourth NaN with either blue or red, since they are tied for the most frequent color for a ball.

Notes about the real dataset:

  • There are many different toy types (not just four).
  • There are no toy types that only have missing values for color, so the answer does not need to handle that case.

This question is related, but it doesn't answer my question of how to use the most frequent value to fill in missing values.

Kevin Markham
  • 5,778
  • 1
  • 28
  • 36

2 Answers2

3

You can use groupby()+transform()+fillna():

df['color']=df['color'].fillna(df.groupby('toy')['color'].transform(lambda x:x.mode().iat[0]))

OR

If want to select random values when there are 2 or more frequent values:

from random import choice

df['color']=df['color'].fillna(df.groupby('toy')['color'].transform(lambda x:choice(x.mode())))
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
2

You want to fillna with the mode:

df["color"] = df.groupby("toy")["color"].apply(lambda x: x.fillna(x.mode().iat[0]))
not_speshal
  • 22,093
  • 2
  • 15
  • 30