0

I have a data frame with multiple columns but currently, I am focusing on province and city columns. The problem is I want to fill NaN values in the city column by the most frequent city value in the same province, but if the value count of any city is the same then it should fill with the one which is alphabetically first. I followed an answer from this link and modified it, but values are not being sorted alphabetically for some reason.

Here's my code:

df.groupby(['province'])['city'].agg(lambda x:x.value_counts().sort_values(ascending=False).index[0])

what do I need to change?

Sample DataFrame:

{'province': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'B',
  8: 'B',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B'},
 'city': {0: 'newyork',
  1: nan,
  2: 'newyork',
  3: 'london',
  4: 'london',
  5: nan,
  6: 'houston',
  7: 'hyderabad',
  8: 'karachi',
  9: nan,
  10: 'hyderabad',
  11: nan,
  12: 'beijing',
  13: 'karachi'}}

Required Output:

{'province': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'B',
  8: 'B',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B'},
 'city': {0: 'newyork',
  1: 'london',
  2: 'newyork',
  3: 'london',
  4: 'london',
  5: 'london',
  6: 'houston',
  7: 'hyderabad',
  8: 'karachi',
  9: 'hyderabad',
  10: 'hyderabad',
  11: 'hyderabad',
  12: 'beijing',
  13: 'karachi'}}
Vishwa Mittar
  • 378
  • 5
  • 16

1 Answers1

1

Try with groupby and mode:

mapper = df.groupby("province")["city"].agg(lambda x: x.mode().sort_values()[0]).to_dict()
df["city"] = df["city"].where(df["city"].notnull(),
                              df["province"].map(mapper))

>>> df
   province       city
0         A    newyork
1         A     london
2         A    newyork
3         A     london
4         A     london
5         A     london
6         A    houston
7         B  hyderabad
8         B    karachi
9         B  hyderabad
10        B  hyderabad
11        B  hyderabad
12        B    beijing
13        B    karachi
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • It didn't help, the data doesn't have only equal city value counts, your code is replacing NaN values with the city which is alphabetically first in the same province. I have edited the question and added data frame as well. Maybe it will make sense now. – Vishwa Mittar Mar 23 '22 at 15:54
  • It worked but, can you explain why are we using `notnull()` when we are actually filling null values? – Vishwa Mittar Mar 23 '22 at 16:16
  • 1
    Where the values are not null, keep the original value, for the rest, map with the dictionary. – not_speshal Mar 23 '22 at 16:19