I have a pandas DataFrame with two columns: toy
and color
. The color
column includes missing values.
How do I fill the missing color
values with the most frequent color
for that particular toy
?
Here's the code to create a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'toy':['car'] * 4 + ['train'] * 5 + ['ball'] * 3 + ['truck'],
'color':['red', 'blue', 'blue', np.nan, 'green', np.nan,
'red', 'red', np.nan, 'blue', 'red', np.nan, 'green']
})
Here's the sample dataset:
toy color
0 car red
1 car blue
2 car blue
3 car NaN
4 train green
5 train NaN
6 train red
7 train red
8 train NaN
9 ball blue
10 ball red
11 ball NaN
12 truck green
Here's the desired result:
- Replace the first NaN with blue, since that is the most frequent
color
for a car. - Replace the second and third NaNs with red, since that is the most frequent
color
for a train. - Replace the fourth NaN with either blue or red, since they are tied for the most frequent
color
for a ball.
Notes about the real dataset:
- There are many different
toy
types (not just four). - There are no
toy
types that only have missing values forcolor
, so the answer does not need to handle that case.
This question is related, but it doesn't answer my question of how to use the most frequent value to fill in missing values.