6

I have multiple simple functions that need to be implemented on every row of certain columns of my dataframe. The dataframe is very like, 10 million+ rows. My dataframe is something like this:

Date      location   city        number  value
12/3/2018   NY       New York      2      500
12/1/2018   MN       Minneapolis   3      600
12/2/2018   NY       Rochester     1      800
12/3/2018   WA       Seattle       2      400

I have functions like these:

def normalized_location(row):
    if row['city'] == " Minneapolis":
        return "FCM"
    elif row['city'] == "Seattle":
        return "FCS"
    else:
        return "Other"

and then I use:

df['Normalized Location'] =df.apply (lambda row: normalized_location (row),axis=1)

This is extremely slow, how can I make this more efficient?

cs95
  • 379,657
  • 97
  • 704
  • 746
Nazanin Zinouri
  • 229
  • 3
  • 9

4 Answers4

9

We can make this BLAZING fast using map with a defaultdict.

from collections import defaultdict

d = defaultdict(lambda: 'Other')
d.update({"Minneapolis": "FCM", "Seattle": "FCS"})

df['normalized_location'] = df['city'].map(d)

print(df)
        Date location         city  number  value normalized_location
0  12/3/2018       NY     New York       2    500               Other
1  12/1/2018       MN  Minneapolis       3    600                 FCM
2  12/2/2018       NY    Rochester       1    800               Other
3  12/3/2018       WA      Seattle       2    400                 FCS

...to circumvent a fillna call, for performance reasons. This approach generalises to multiple replacements quite easily.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • I can Google (https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas), but can you briefly explain or link to an explanation for the `defaultdict(lambda: 'Other')` line? If the location is not in the `dict`, it gets set to `Other`? – Evan Dec 03 '18 at 21:44
  • 1
    @Evan You got it, when the key is not present, "Other" is returned by the dictionary. – cs95 Dec 03 '18 at 21:48
  • 1
    This is really fast. Thank you!! – Nazanin Zinouri Dec 03 '18 at 23:53
  • How would you do this if you needed to use regex instead of "Minneapolis" or "Seattle"? – jryan14ify Sep 19 '22 at 20:50
5

You might want to use np.select:

conds = [df.city == 'Minneapolis', df.city == 'Seattle']
choices = ['FCM', 'FCS']

df['normalized_location'] = np.select(conds, choices, default='other')

>>> df
        Date location         city  number  value normalized_location
0  12/3/2018       NY     New York       2    500               other
1  12/1/2018       MN  Minneapolis       3    600                 FCM
2  12/2/2018       NY    Rochester       1    800               other
3  12/3/2018       WA      Seattle       2    400                 FCS
sacuL
  • 49,704
  • 8
  • 81
  • 106
3

Try this instead:

map_ = {'Minneapolis':'FCM', 'Seattle':'FCS'}
df.loc[:,'city'] = df.loc[:,'city'].map(map_).fillna('Other')

print(df)
    Date      location  city    number  value
0  12/3/2018       NY  Other       2    500
1  12/1/2018       MN    FCM       3    600
2  12/2/2018       NY  Other       1    800
3  12/3/2018       WA    FCS       2    400
yatu
  • 86,083
  • 12
  • 84
  • 139
2

You can use a nested np.where():

df['city'] = np.where(df['city']=='Minneapolis', 'FCM', np.where(df['city']=='Seattle', 'FCS', 'Other'))
rahlf23
  • 8,869
  • 4
  • 24
  • 54