Pandas and Multiprocessing

Question

I am using a FCC api to convert lat/long coordinates into block group codes:

import pandas as pd
import numpy as np
import urllib
import time
import json

# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

new_list = []

def block(x):
    for index,row in x.iterrows():
        #request url and read the output
        a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
        #load json output in to a form python can understand
        a1 = json.loads(a)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])

#call the function with latlong as the argument.        
block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

gives this output:

['360610031001021', '060372074001033', '170318391001104', '482011000003087', 
 '421010005001010', '040131141001032', '480291101002041', '060730053003011', 
 '481130204003064', '060855010004004', '484530011001092', '180973910003057', 
 '120310010001023', '060750201001001', '390490040001005', '371190001005000', 
 '484391233002071', '261635172001069', '481410029001001', '471570042001018']

The problem with this script is that I can only call the api once per row. It takes about 5 minutes per thousand for the script to run, which is not acceptable with 1,000,000+ entries I am planning on using this script with.

I want to use multiprocessing to parallel this function to decrease the time to run the function. I have tried to look in to the multiprocessing handbook, but have not been able to figure out how to run the function and append the output in to an empty list in parallel.

Just for reference: I am using python 3.6

Any guidance would be great!

Hey, you may want to look at the [python GIL](https://wiki.python.org/moin/GlobalInterpreterLock). Using parrallelism in python most of the time raise the computing time instead of decreasing it. — Tbaki, Oct 04 '17 at 16:52
Since you're IO bound, threads make sense here, will have to restructure your problem to avoid appending to global list. Docs here a good place to start - https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example — chrisb, Oct 04 '17 at 17:27
@Tbaki `multiprocessing` is not affected by the GIL, indeed, it was created to provide a `threading` -like api to create multiple processes to *by-pass* the limitations of the GIL. As @chrisb points out, though, since this code is IO bound, `threading` won't be limited by the GIL either. — juanpa.arrivillaga, Oct 04 '17 at 17:36

score 1 · Accepted Answer · answered Oct 04 '17 at 17:28

You do not have to implement the parallelism yourself, there are libraries better than urllib, e.g. requests [0] and some spin-offs [1] which use either threads or futures. I guess you need to check yourself which one is the fastest.

Because of the small amount of dependencies I like the requests-futures best, here my implementation of your code using ten threads. The library would even support processes if you believe or figure out that it is somehow better in your case:

import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor

from requests_futures.sessions import FuturesSession

#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

def block(x):
    requests = []
    session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
    for index, row in x.iterrows():
        #request url and read the output
        url = getup+row['lat']+getup1+row['long']+getup2        
        requests.append(session.get(url))
    new_list = []
    for request in requests:
        #load json output in to a form python can understand
        a1 = json.loads(request.result().content)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])
    return new_list

#call the function with latlong as the argument.        
new_list = block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

[0] http://docs.python-requests.org/en/master/

[1] https://github.com/kennethreitz/grequests

This worked great! I went from 5 minutes per thousand to 1 minute per thousand. — Drew Folz, Oct 05 '17 at 01:45

Pandas and Multiprocessing

1 Answers1