I have a pandas dataframe with two columns, city
and country
. Both city
and country
contain missing values. consider this data frame:
temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
"city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})
I now want to fill in the NaN
s in the city
column with the mode of the country's city in the remaining data frame, e.g. for country A: city 1 is mentioned once; city 2 is mentioned twice; thus, fill the column city
at index 2
with city 2
etc.
I have done
cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
new_dict_locations[k] = dict_locations[k]["city"]
Now having the value of the country and the corresponding city mode, I face two issues:
First: the case country C
is bimodal - the key contains two entries. I want this key to refer to each of the entries with equal probability. The real data set has multiple modes, so it would be a list of len > 2.
Second: I'm stuck replacing the NaN
s in city
with the value corresponding to the value in the same line's country
cell in new_dict_locations
. In pseudo-code, this would be: `go through the column 'city'; if you find a missing value at position 'temp[i, city]', take the value of 'country' in that row (-> 'country_tmp'); take 'country_tmp' as key to the dictionary 'new_dict_locations'; if the dictionary at key 'country_temp' is a list, randomly select one item from that list; take the return value (-> 'city_tmp') and fill the cell with the missing value (temp[i, city]) with the value 'city_temp').
I've tried using different combinations of .fillna()
and .replace()
(and read this and other questions to no avail.* Can someone give me a pointer?
Many thanks in advance.
(Note: the referenced question replaces values in one cell according to a dict; my reference values are, however, in a different column.)
** EDIT **
executing temp["city"].fillna(temp['country'], inplace=True)
and temp.replace({'city': dict_locations})
gives me an error: TypeError: unhashable type: 'dict'
[This error is TypeError: unhashable type: 'numpy.ndarray'
for the original data set but I cannot reproduce it with an example - if someone knows the whereabouts of the difference, I'd be super happy to hear their thoughts.]