How can I increase DataFrame's apply function efficiency?

Question

I know it is kinds of impossible but I still look forward to an alternative. Here is my DataFrame (read from csv file):

I need to recognize the country and continent where the location is, and here is my idea:

import pandas as pd
! pip install geopy pycountry_convert

from geopy.geocoders import Nominatim
import pycountry_convert as pc

def geo(text):
  if text:
    try:
      country = geolocator.geocode(text, language='en')[0].split(', ')[-1]
      country_code = pc.country_name_to_country_alpha2(country, cn_name_format="default")
      continent_code = pc.country_alpha2_to_continent_code(country_code)
      return country_code, country, continent_code, continents[continent_code]
    except:
      return None, None, None, None
  else:
    return None, None, None, None
continents = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Oceania',
    'AF': 'Africa',
    'EU': 'Europe'
}

geolocator = Nominatim()
country_dict = json.loads("{\"AD\":\"Europe\",\"AE\":\"Asia\",\"AF\":\"Asia\",\"AG\":\"North America\",\"AI\":\"North America\",\"AL\":\"Europe\",\"AM\":\"Asia\",\"AN\":\"North America\",\"AO\":\"Africa\",\"AQ\":\"Antarctica\",\"AR\":\"South America\",\"AS\":\"Australia\",\"AT\":\"Europe\",\"AU\":\"Australia\",\"AW\":\"North America\",\"AZ\":\"Asia\",\"BA\":\"Europe\",\"BB\":\"North America\",\"BD\":\"Asia\",\"BE\":\"Europe\",\"BF\":\"Africa\",\"BG\":\"Europe\",\"BH\":\"Asia\",\"BI\":\"Africa\",\"BJ\":\"Africa\",\"BM\":\"North America\",\"BN\":\"Asia\",\"BO\":\"South America\",\"BR\":\"South America\",\"BS\":\"North America\",\"BT\":\"Asia\",\"BW\":\"Africa\",\"BY\":\"Europe\",\"BZ\":\"North America\",\"CA\":\"North America\",\"CC\":\"Asia\",\"CD\":\"Africa\",\"CF\":\"Africa\",\"CG\":\"Africa\",\"CH\":\"Europe\",\"CI\":\"Africa\",\"CK\":\"Australia\",\"CL\":\"South America\",\"CM\":\"Africa\",\"CN\":\"Asia\",\"CO\":\"South America\",\"CR\":\"North America\",\"CU\":\"North America\",\"CV\":\"Africa\",\"CX\":\"Asia\",\"CY\":\"Asia\",\"CZ\":\"Europe\",\"DE\":\"Europe\",\"DJ\":\"Africa\",\"DK\":\"Europe\",\"DM\":\"North America\",\"DO\":\"North America\",\"DZ\":\"Africa\",\"EC\":\"South America\",\"EE\":\"Europe\",\"EG\":\"Africa\",\"EH\":\"Africa\",\"ER\":\"Africa\",\"ES\":\"Europe\",\"ET\":\"Africa\",\"FI\":\"Europe\",\"FJ\":\"Australia\",\"FK\":\"South America\",\"FM\":\"Australia\",\"FO\":\"Europe\",\"FR\":\"Europe\",\"GA\":\"Africa\",\"GB\":\"Europe\",\"GD\":\"North America\",\"GE\":\"Asia\",\"GF\":\"South America\",\"GG\":\"Europe\",\"GH\":\"Africa\",\"GI\":\"Europe\",\"GL\":\"North America\",\"GM\":\"Africa\",\"GN\":\"Africa\",\"GP\":\"North America\",\"GQ\":\"Africa\",\"GR\":\"Europe\",\"GS\":\"Antarctica\",\"GT\":\"North America\",\"GU\":\"Australia\",\"GW\":\"Africa\",\"GY\":\"South America\",\"HK\":\"Asia\",\"HN\":\"North America\",\"HR\":\"Europe\",\"HT\":\"North America\",\"HU\":\"Europe\",\"ID\":\"Asia\",\"IE\":\"Europe\",\"IL\":\"Asia\",\"IM\":\"Europe\",\"IN\":\"Asia\",\"IO\":\"Asia\",\"IQ\":\"Asia\",\"IR\":\"Asia\",\"IS\":\"Europe\",\"IT\":\"Europe\",\"JE\":\"Europe\",\"JM\":\"North America\",\"JO\":\"Asia\",\"JP\":\"Asia\",\"KE\":\"Africa\",\"KG\":\"Asia\",\"KH\":\"Asia\",\"KI\":\"Australia\",\"KM\":\"Africa\",\"KN\":\"North America\",\"KP\":\"Asia\",\"KR\":\"Asia\",\"KW\":\"Asia\",\"KY\":\"North America\",\"KZ\":\"Asia\",\"LA\":\"Asia\",\"LB\":\"Asia\",\"LC\":\"North America\",\"LI\":\"Europe\",\"LK\":\"Asia\",\"LR\":\"Africa\",\"LS\":\"Africa\",\"LT\":\"Europe\",\"LU\":\"Europe\",\"LV\":\"Europe\",\"LY\":\"Africa\",\"MA\":\"Africa\",\"MC\":\"Europe\",\"MD\":\"Europe\",\"ME\":\"Europe\",\"MG\":\"Africa\",\"MH\":\"Australia\",\"MK\":\"Europe\",\"ML\":\"Africa\",\"MM\":\"Asia\",\"MN\":\"Asia\",\"MO\":\"Asia\",\"MP\":\"Australia\",\"MQ\":\"North America\",\"MR\":\"Africa\",\"MS\":\"North America\",\"MT\":\"Europe\",\"MU\":\"Africa\",\"MV\":\"Asia\",\"MW\":\"Africa\",\"MX\":\"North America\",\"MY\":\"Asia\",\"MZ\":\"Africa\",\"NA\":\"Africa\",\"NC\":\"Australia\",\"NE\":\"Africa\",\"NF\":\"Australia\",\"NG\":\"Africa\",\"NI\":\"North America\",\"NL\":\"Europe\",\"NO\":\"Europe\",\"NP\":\"Asia\",\"NR\":\"Australia\",\"NU\":\"Australia\",\"NZ\":\"Australia\",\"OM\":\"Asia\",\"PA\":\"North America\",\"PE\":\"South America\",\"PF\":\"Australia\",\"PG\":\"Australia\",\"PH\":\"Asia\",\"PK\":\"Asia\",\"PL\":\"Europe\",\"PM\":\"North America\",\"PN\":\"Australia\",\"PR\":\"North America\",\"PS\":\"Asia\",\"PT\":\"Europe\",\"PW\":\"Australia\",\"PY\":\"South America\",\"QA\":\"Asia\",\"RE\":\"Africa\",\"RO\":\"Europe\",\"RS\":\"Europe\",\"RU\":\"Europe\",\"RW\":\"Africa\",\"SA\":\"Asia\",\"SB\":\"Australia\",\"SC\":\"Africa\",\"SD\":\"Africa\",\"SE\":\"Europe\",\"SG\":\"Asia\",\"SH\":\"Africa\",\"SI\":\"Europe\",\"SJ\":\"Europe\",\"SK\":\"Europe\",\"SL\":\"Africa\",\"SM\":\"Europe\",\"SN\":\"Africa\",\"SO\":\"Africa\",\"SR\":\"South America\",\"ST\":\"Africa\",\"SV\":\"North America\",\"SY\":\"Asia\",\"SZ\":\"Africa\",\"TC\":\"North America\",\"TD\":\"Africa\",\"TF\":\"Antarctica\",\"TG\":\"Africa\",\"TH\":\"Asia\",\"TJ\":\"Asia\",\"TK\":\"Australia\",\"TM\":\"Asia\",\"TN\":\"Africa\",\"TO\":\"Australia\",\"TR\":\"Asia\",\"TT\":\"North America\",\"TV\":\"Australia\",\"TW\":\"Asia\",\"TZ\":\"Africa\",\"UA\":\"Europe\",\"UG\":\"Africa\",\"US\":\"North America\",\"UY\":\"South America\",\"UZ\":\"Asia\",\"VC\":\"North America\",\"VE\":\"South America\",\"VG\":\"North America\",\"VI\":\"North America\",\"VN\":\"Asia\",\"VU\":\"Australia\",\"WF\":\"Australia\",\"WS\":\"Australia\",\"YE\":\"Asia\",\"YT\":\"Africa\",\"ZA\":\"Africa\",\"ZM\":\"Africa\",\"ZW\":\"Africa\"}")
test['GEO'] = test['LOCATION'].apply(geo)

It looks like this would run forever because my dataframe has more than 500 million records. Can I still improve it?

Well I don't know if it will be more performant, but `geopandas` is a thing — anon01, Nov 25 '20 at 04:06
500m is a lot. First thought it to try `dask`; consider throwing more compute at this as it's trivially parallelizable. — anon01, Nov 25 '20 at 04:09
can you add the full code? `geolocator.geocode` is not defined — anon01, Nov 25 '20 at 05:50
if it's a dataframe, consider adding result of `test = df.sample(n=20).to_dict()` for example — anon01, Nov 26 '20 at 03:26

Kenan · Answer 1 · 2020-11-25T04:09:00.243

2

Since your using other package (ie. geopy) I don't think there can be much done about the logic; however you can improve execution with swifter.

Swifter will try to use all cpu cores for the process

import swifter
test['GEO'] = test['LOCATION'].swifter.apply(geo)

Here is an article about swifter

edited Nov 25 '20 at 04:09

answered Nov 25 '20 at 04:06

Kenan

13,156
8
43
50

yes I use it all the time when my df.shape[0] > 50k – Kenan Nov 25 '20 at 04:08
LOL I tried it just now, it seems that it is applicable to my situation because what I need to handle is textual data while swifter only works well with vectorized functions. – Memphis Meng Nov 25 '20 at 04:32
swifter will at least speed up the process with multiprocessing – Kenan Nov 25 '20 at 04:58

score 1 · Accepted Answer · answered Nov 26 '20 at 04:21

First some soapboxing about slow code: you really should profile your code to understand why/where it's slow before trying to make it faster. In your case,geolocator.geocode makes a network call (probably rate limited) which takes ~1 second; for a 500m row dataframe, that will take ~15 years to complete. Dask/swifter/spark are not the right solution to this problem either, and will waste compute needlessly without making it much faster (maybe only 3 years :).

To solve the problem at hand: geolocator.geocode just tries to get the country name - but it's easy to get all values locally to try. A minimal adaptation that will improve performance by orders of magnitude would look like this:

# find a package that contains country names
!pip install country_list
from country_list import available_languages
country_dict = dict(countries_for_language('en'))
COUNTRIES = set(country_dict.values())


def geo(text):
  if text:
    try:
      country_name = text.split(", ")[-1]
      if country_name in COUNTRIES:
        country_code = pc.country_name_to_country_alpha2(country_name, cn_name_format="default")
      continent_code = pc.country_alpha2_to_continent_code(country_code)
      return country_code, country_name, continent_code, continents[continent_code]
    except:
      return None, None, None, None
  else:
    return None, None, None, None

This should run a million times faster (maybe literally). You could improve on this with better parsing of your location values, but this will be a huge lift.

It might be a good shot, but many records with my data are intelligible. For example, "香港daibo" actually and literally means "Daibo, Hong Kong". But when using geopy, this is not decrypted since there is a grammatic error. I'm afraid the same thing would happen. Anyway, I will let you know when I tried. — Memphis Meng, Nov 26 '20 at 16:40
You can still improve on the above massively by 1) vectorizing the above approach and 2) inferring values when there isn't a perfect match. Tag me if you ask another related question and I'll take a look — anon01, Nov 26 '20 at 17:04
That makes sense, I don't have so much experience in vectorizing textual data. Maybe it needs some of your advice. — Memphis Meng, Nov 26 '20 at 17:06

How can I increase DataFrame's apply function efficiency?

2 Answers2