3

I have a pandas dataframe with two columns, city and country. Both city and country contain missing values. consider this data frame:

temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
                     "city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})

I now want to fill in the NaNs in the city column with the mode of the country's city in the remaining data frame, e.g. for country A: city 1 is mentioned once; city 2 is mentioned twice; thus, fill the column city at index 2 with city 2 etc.

I have done

cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
     new_dict_locations[k] = dict_locations[k]["city"]

Now having the value of the country and the corresponding city mode, I face two issues:

First: the case country C is bimodal - the key contains two entries. I want this key to refer to each of the entries with equal probability. The real data set has multiple modes, so it would be a list of len > 2.

Second: I'm stuck replacing the NaNs in city with the value corresponding to the value in the same line's country cell in new_dict_locations. In pseudo-code, this would be: `go through the column 'city'; if you find a missing value at position 'temp[i, city]', take the value of 'country' in that row (-> 'country_tmp'); take 'country_tmp' as key to the dictionary 'new_dict_locations'; if the dictionary at key 'country_temp' is a list, randomly select one item from that list; take the return value (-> 'city_tmp') and fill the cell with the missing value (temp[i, city]) with the value 'city_temp').

I've tried using different combinations of .fillna() and .replace() (and read this and other questions to no avail.* Can someone give me a pointer?

Many thanks in advance.

(Note: the referenced question replaces values in one cell according to a dict; my reference values are, however, in a different column.)

** EDIT ** executing temp["city"].fillna(temp['country'], inplace=True) and temp.replace({'city': dict_locations}) gives me an error: TypeError: unhashable type: 'dict' [This error is TypeError: unhashable type: 'numpy.ndarray' for the original data set but I cannot reproduce it with an example - if someone knows the whereabouts of the difference, I'd be super happy to hear their thoughts.]

Ivo
  • 3,890
  • 5
  • 22
  • 53
  • What is meaning of this line "I want this key to refer to each of the entries with equal probability. " Can you give expected output for given case? – Parth Sep 16 '19 at 09:14
  • when I look up "country C" in the dictionary, I want it to randomly pick from ["city 5", "city 6"] – Ivo Sep 16 '19 at 09:16

2 Answers2

5

Try map with dict new_dict_locations to create a new series s, and map again on s with np.random.choice to pick value from array. Finally, use s to fillna

s = (temp.country.map(new_dict_locations)
                 .map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x))

temp['city'] = temp.city.fillna(s)    

Out[247]:
      country    city
0   country A  city 1
1   country A  city 2
2   country A  city 2
3   country A  city 2
4   country B  city 3
5   country B  city 3
6   country B  city 3
7   country B  city 4
8   country C  city 5
9   country C  city 6
10  country C  city 5
11  country C  city 6

Note: I thought 2 map may be joined to one by using dict comprehension. However, doing it will cause loosing of the randomness.

Andy L.
  • 24,909
  • 4
  • 17
  • 29
2
def get_mode(d):
    for k,v in d.items():
        if len(v)>1 and isinstance(v, np.ndarray):
            d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
    return d

Below dictionary is the one which will be used for filling.

new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())

# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)

This will give desired output:

country    city
0   country A  city 1
1   country A  city 2
2   country A  city 2
3   country A  city 2
4   country B  city 3
5   country B  city 3
6   country B  city 3
7   country B  city 4
8   country C  city 5
9   country C  city 5
10  country C  city 5
11  country C  city 6
Parth
  • 644
  • 4
  • 10