Filling NaN using groupby dict

Question

I am attempting to replace the NaN in the 'cylinders' column using a dictionary with the median model cylinder number. I think it should work easily but I have been getting stuck each way I try.

cylinders_model_med = df.groupby('model').agg({'cylinders': 'median'})

cylinders_model_med=cylinders_model_med.to_dict()

#output:

  'bmw x5': 6.0,
  'buick enclave': 6.0,
  'cadillac escalade': 8.0,
  'chevrolet camaro': 6.0,
  'chevrolet camaro lt coupe 2d': 6.0,
  'chevrolet colorado': 5.0,
  'chevrolet corvette': 8.0,
  'chevrolet cruze': 4.0,
  'chevrolet equinox': 4.0,
  'chevrolet impala': 6.0,
  'chevrolet malibu': 4.0,
  'chevrolet silverado': 8.0,
  'chevrolet silverado 1500': 8.0,
  'chevrolet silverado 1500 crew': 8.0,
  'chevrolet silverado 2500hd': 8.0,
  'chevrolet silverado 3500hd': 8.0,
  'chevrolet suburban': 8.0,
  'chevrolet tahoe': 8.0,
  'chevrolet trailblazer': 6.0,
  'chevrolet traverse': 6.0,
  'chrysler 200': 4.0,
  'chrysler 300': 6.0,
  'chrysler town & country': 6.0,
  'dodge charger': 6.0,
  'dodge dakota': 6.0,
  'dodge grand caravan': 6.0,
  'ford econoline': 8.0,
  'ford edge': 6.0,
  'ford escape': 4.0,
  'ford expedition': 8.0,
  'ford explorer': 6.0,
  'ford f-150': 8.0,
  'ford f-250': 8.0,
  'ford f-250 sd': 8.0,
  'ford f-250 super duty': 8.0,
  'ford f-350 sd': 8.0,
  'ford f150': 8.0,
  'ford f150 supercrew cab xlt': 6.0,
  'ford f250': 8.0,
  'ford f250 super duty': 8.0,
  'ford f350': 8.0,
  'ford f350 super duty': 8.0,
  'ford focus': 4.0,
  'ford focus se': 4.0,
  'ford fusion': 4.0,
  'ford fusion se': 4.0,
  'ford mustang': 6.0,
  'ford mustang gt coupe 2d': 8.0,
  'ford ranger': 6.0,
  'ford taurus': 6.0,
  'gmc acadia': 6.0,
  'gmc sierra': 8.0,
  'gmc sierra 1500': 8.0,
  'gmc sierra 2500hd': 8.0,
  'gmc yukon': 8.0,
  'honda accord': 4.0,
  'honda civic': 4.0,
  'honda civic lx': 4.0,
  'honda cr-v': 4.0,
  'honda odyssey': 6.0,
  'honda pilot': 6.0,
  'hyundai elantra': 4.0,
  'hyundai santa fe': 6.0,
  'hyundai sonata': 4.0,
  'jeep cherokee': 6.0,
  'jeep grand cherokee': 6.0,
  'jeep grand cherokee laredo': 6.0,
  'jeep liberty': 6.0,
  'jeep wrangler': 6.0,
  'jeep wrangler unlimited': 6.0,
  'kia sorento': 4.0,
  'kia soul': 4.0,
  'mercedes-benz benze sprinter 2500': 6.0,
  'nissan altima': 4.0,
  'nissan frontier': 6.0,
  'nissan frontier crew cab sv': 6.0,
  'nissan maxima': 6.0,
  'nissan murano': 6.0,
  'nissan rogue': 4.0,
  'nissan sentra': 4.0,
  'nissan versa': 4.0,
  'ram 1500': 8.0,
  'ram 2500': 6.0,
  'ram 3500': 6.0,
  'subaru forester': 4.0,
  'subaru impreza': 4.0,
  'subaru outback': 4.0,
  'toyota 4runner': 6.0,
  'toyota camry': 4.0,
  'toyota camry le': 4.0,
  'toyota corolla': 4.0,
  'toyota highlander': 6.0,
  'toyota prius': 4.0,
  'toyota rav4': 4.0,
  'toyota sienna': 6.0,
  'toyota tacoma': 6.0,
  'toyota tundra': 8.0,
  'volkswagen jetta': 4.0,
  'volkswagen passat': 4.0}} 
#input:
df['cylinders']=df['cylinders'].fillna(cylinders_model_med)
df['cylinders'].isna().sum()
#output
5260

This is the same number of NaN I started with. I am new here so let me know if you need more (or less) information.

Thank you for your time!

Does this answer your question? [Pandas: fill in NaN values with dictionary references another column](https://stackoverflow.com/questions/42848911/pandas-fill-in-nan-values-with-dictionary-references-another-column) — Ignatius Reilly, Jun 26 '22 at 23:40
Not quite. It says that the float type is not iterable. I cant change the type because there are null values present. I am also not sure that this option is what I am looking for because I am not trying to use values from another column I am trying to use the medians from the cylinders column to fill the nan in the cylinders column. I will keep at it! Thank you for trying to help me. — hannah_pella, Jun 27 '22 at 03:17
In this example, the values from the other column aren't used to replace the NaNs, but as keys of a dictionary. The NaNs get replaced by the values from that dictionary (in your case, the medians). — Ignatius Reilly, Jun 27 '22 at 03:21
cylinders_model_med = df.groupby('model').agg({'cylinders': 'median'}) cy_dict=cylinders_model_med.to_dict() #cy_dict df['cylinders'] = df['cylinders'].fillna(df['model'].map(cy_dict)) I tried it like this and all of the NaN remain. any advice?? — hannah_pella, Jun 27 '22 at 13:44
Are you sure the dictionary covers all the possible cylinder's model? Or is it possible that you have all the values for certain models, but all the values for the others are missing? Can you take one of the rows with NaN and check if the name of the cylinder is in the keys of cylinders_model_med? — Ignatius Reilly, Jun 27 '22 at 15:43

score 0 · Answer 1 · answered Jun 26 '22 at 23:42

0

Alignment in pandas is based on index, so you need to create defaults which either you explicitly align to your dataframe or which will be aligned automatically. The easiest way to do this would be to create a default series with the same index as df using replace:

defaults = df['model'].replace(cylinders_model_med)
df['cylinders'] = df['cylinders'].fillna(defaults)

See the docs for more info: vectorized operations and label alignment

answered Jun 26 '22 at 23:42

Michael Delgado

13,789
3
29
54

I am attempting to create the defaults but I am getting this error. input:cylinders_model_med = df.groupby('model').agg({'cylinders': 'median'}) defaults=df['model'].replace(cylinders_model_med) output: AttributeError: 'Series' object has no attribute '_replace_columnwise' do you know what is causing this? – hannah_pella Jun 27 '22 at 02:44
in your question, `cylinders_model_med` is a dict - it looks like in this it's still a series – Michael Delgado Jun 27 '22 at 04:05
I made sure it was a dictionary by applying to_dict, as it states in my original question. In my notebook I changed the variable name to make sure and I got the same error as my comment here. Why would it be read as a series? do you have any advice on how to work around this? – hannah_pella Jun 27 '22 at 13:32
Make sure it actually looks like the values in your question and doesn’t have any nesting. It should just be a single level of key value pairs. – Michael Delgado Jun 27 '22 at 15:09

jch · Accepted Answer · 2022-06-27T04:46:10.383

0

Filling in NaN values from corresponding values is what combine_first does for a living. You could calculate the median cylinder number by model then fill in the original dataframe NaN cylinder numbers by model.

Assume this starting dataframe

            model  cylinders
0   nissan maxima        6.0
1   nissan maxima        6.0
2   nissan maxima        4.0
3   nissan murano        6.0
4   nissan murano        NaN
5   nissan murano        4.0
6   nissan murano        6.0
7    nissan rogue        4.0
8    nissan rogue        4.0
9    nissan rogue        NaN
10   nissan rogue        6.0
11  nissan sentra        6.0
12  nissan sentra        4.0
13  nissan sentra        4.0
14   nissan versa        4.0
15   nissan versa        4.0
16   nissan versa        NaN
17   nissan versa        4.0

Calculate the median cylinders by model and fill in the NaNs

df.assign(cylinders=df['cylinders'].combine_first(df[['model','cylinders']].groupby('model').transform('median').squeeze()))

Result

            model  cylinders
0   nissan maxima        6.0
1   nissan maxima        6.0
2   nissan maxima        4.0
3   nissan murano        6.0
4   nissan murano        6.0
5   nissan murano        4.0
6   nissan murano        6.0
7    nissan rogue        4.0
8    nissan rogue        4.0
9    nissan rogue        4.0
10   nissan rogue        6.0
11  nissan sentra        6.0
12  nissan sentra        4.0
13  nissan sentra        4.0
14   nissan versa        4.0
15   nissan versa        4.0
16   nissan versa        4.0
17   nissan versa        4.0

edited Jun 27 '22 at 04:46

answered Jun 27 '22 at 02:27

jch

3,600
1
15
17

hi! I tried your code and I got this error: NotImplementedError: cannot align with a higher dimensional NDFrame do you know where I am going wrong? – hannah_pella Jun 27 '22 at 02:51
Somehow the `.squeeze()` allowed a dataframe to sneak through. Can only align with a Series not a 2D dataframe. I updated my answer. See if that works for your data. – jch Jun 27 '22 at 04:48
Use `df.assign(cylinders=df['cylinders'].combine_first(df.groupby('model')['cylinders'].transform('median'))` – jezrael Jun 27 '22 at 04:51
No error this time. I tried both of the options provided here and although there is no error the, NaN are still there! all 5k of them. Both options returned a dataframe that looks similar to the original and when I sum using isna() I get the original number of NaN. How can this be? thanks for your continued help with this! – hannah_pella Jun 27 '22 at 13:27
@hannah_pella First thought is that the command above creates a new dataframe and then does the processing. I should be more clear in my answers in this respect. To assign that newly created dataframe back to the original dataframe you would need to do: `df = df.assign(cylinders= ...)`. If that doesn't solve the problem then it has to be related to the data. If that is the case then please provide an example of your original DF (or a reasonable facsimile). We'll get it to work! – jch Jun 27 '22 at 15:12
@jch That worked! I am going to try this with my other columns now. Thank you so much for your help. I am shocked at how difficult that was BUT now I have the solution for later down the road as well. Cheers! – hannah_pella Jun 27 '22 at 17:36

Filling NaN using groupby dict

2 Answers2