Replace values in pandas column with default value for missing keys

Question

I have multiple simple functions that need to be implemented on every row of certain columns of my dataframe. The dataframe is very like, 10 million+ rows. My dataframe is something like this:

Date      location   city        number  value
12/3/2018   NY       New York      2      500
12/1/2018   MN       Minneapolis   3      600
12/2/2018   NY       Rochester     1      800
12/3/2018   WA       Seattle       2      400

I have functions like these:

def normalized_location(row):
    if row['city'] == " Minneapolis":
        return "FCM"
    elif row['city'] == "Seattle":
        return "FCS"
    else:
        return "Other"

and then I use:

df['Normalized Location'] =df.apply (lambda row: normalized_location (row),axis=1)

This is extremely slow, how can I make this more efficient?

score 9 · Accepted Answer · answered Dec 03 '18 at 21:31

9

We can make this BLAZING fast using map with a defaultdict.

from collections import defaultdict

d = defaultdict(lambda: 'Other')
d.update({"Minneapolis": "FCM", "Seattle": "FCS"})

df['normalized_location'] = df['city'].map(d)

print(df)
        Date location         city  number  value normalized_location
0  12/3/2018       NY     New York       2    500               Other
1  12/1/2018       MN  Minneapolis       3    600                 FCM
2  12/2/2018       NY    Rochester       1    800               Other
3  12/3/2018       WA      Seattle       2    400                 FCS

...to circumvent a fillna call, for performance reasons. This approach generalises to multiple replacements quite easily.

answered Dec 03 '18 at 21:31

cs95

379,657
97
704
746

I can Google (https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas), but can you briefly explain or link to an explanation for the `defaultdict(lambda: 'Other')` line? If the location is not in the `dict`, it gets set to `Other`? – Evan Dec 03 '18 at 21:44
1

@Evan You got it, when the key is not present, "Other" is returned by the dictionary. – cs95 Dec 03 '18 at 21:48
1

This is really fast. Thank you!! – Nazanin Zinouri Dec 03 '18 at 23:53
How would you do this if you needed to use regex instead of "Minneapolis" or "Seattle"? – jryan14ify Sep 19 '22 at 20:50

sacuL · Answer 2 · 2018-12-03T21:35:18.087

5

You might want to use np.select:

conds = [df.city == 'Minneapolis', df.city == 'Seattle']
choices = ['FCM', 'FCS']

df['normalized_location'] = np.select(conds, choices, default='other')

>>> df
        Date location         city  number  value normalized_location
0  12/3/2018       NY     New York       2    500               other
1  12/1/2018       MN  Minneapolis       3    600                 FCM
2  12/2/2018       NY    Rochester       1    800               other
3  12/3/2018       WA      Seattle       2    400                 FCS

edited Dec 03 '18 at 21:35

answered Dec 03 '18 at 21:27

sacuL

49,704
8
81
106

1

Learn something new every day +1 – rahlf23 Dec 03 '18 at 21:29
2

Just a small note, for each replacement, you will need to compute a separate mask. – cs95 Dec 03 '18 at 21:32

yatu · Answer 3 · 2018-12-03T21:36:21.693

3

Try this instead:

map_ = {'Minneapolis':'FCM', 'Seattle':'FCS'}
df.loc[:,'city'] = df.loc[:,'city'].map(map_).fillna('Other')

print(df)
    Date      location  city    number  value
0  12/3/2018       NY  Other       2    500
1  12/1/2018       MN    FCM       3    600
2  12/2/2018       NY  Other       1    800
3  12/3/2018       WA    FCS       2    400

edited Dec 03 '18 at 21:36

answered Dec 03 '18 at 21:29

yatu

86,083
12
84
139

score 2 · Answer 4 · answered Dec 03 '18 at 21:27

2

You can use a nested np.where():

df['city'] = np.where(df['city']=='Minneapolis', 'FCM', np.where(df['city']=='Seattle', 'FCS', 'Other'))

answered Dec 03 '18 at 21:27

rahlf23

8,869
4
24
54

This turns out to be faster than `np.select` for nesting depths upto 3. – cs95 Dec 03 '18 at 21:38

Replace values in pandas column with default value for missing keys

4 Answers4