Running for loop in parallel via python

Question

I have a process that loops over a list of IP addresses and returns some information about them. The simple for loop works great, my issue is running this at scale due to Python's Global Interpreter lock (GIL).

My goal is to have this function run in parallel and take full use of my 4 cores. This way when I run 100K of these it won't take me 24 hours via a normal for loop.

After reading others answers on here, particularly this one, How do I parallelize a simple Python loop?, I decided to use joblib. When I run 10 records thru it(example above), it took over 10 minutes to run. This doesn't sound like it's working right. I know there is something i'm doing wrong or not understanding. Any help is greatly appreciated!

import pandas as pd
import numpy as np
import os as os
from ipwhois import IPWhois
from joblib import Parallel, delayed
import multiprocessing

num_core = multiprocessing.cpu_count()

iplookup = ['174.192.22.197',\
            '70.197.71.201',\
            '174.195.146.248',\
            '70.197.15.130',\
            '174.208.14.133',\
            '174.238.132.139',\
            '174.204.16.10',\
            '104.132.11.82',\
            '24.1.202.86',\
            '216.4.58.18']

Normal for loop which works fine!

asn=[]
asnid=[]
asncountry=[]
asndesc=[]
asnemail = []
asnaddress = []
asncity = []
asnstate = []
asnzip = []
asndesc2 = []
ipaddr=[]
b=1
totstolookup=len(iplookup)

for i in iplookup:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

Function to to pass to joblib to run on all cores!

def run_ip_process(iplookuparray):
    asn=[]
    asnid=[]
    asncountry=[]
    asndesc=[]
    asnemail = []
    asnaddress = []
    asncity = []
    asnstate = []
    asnzip = []
    asndesc2 = []
    ipaddr=[]
    b=1
    totstolookup=len(iplookuparray)

for i in iplookuparray:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

ipdataframe = pd.DataFrame({'ipaddress':ipaddr,
              'asn': asn,
              'asnid':asnid,
              'asncountry':asncountry,
              'asndesc': asndesc,
            'emailcontact': asnemail,
              'address':asnaddress,
              'city':asncity,
              'state': asnstate,
                'zip': asnzip,
              'ipdescrip':asndesc2})

return ipdataframe

run process using all cores via joblib

Parallel(n_jobs=num_core)(delayed(run_ip_process)(iplookuparray) for i in iplookup)

If you are doing network communications, then the limiting factor will be network bandwidth, not CPU or the GIL. — Lawrence D'Oliveiro, Nov 04 '17 at 02:24
I thought so at first as well, but my network is running 100mbps up and down, consistently. — P.Cummings, Nov 04 '17 at 18:05
That’s what I mean: if you are already saturating your link, then multithreading won’t help. — Lawrence D'Oliveiro, Nov 05 '17 at 01:50
Are you sure the service-provider has not implemented a safe-guarding policy to throttle-down any remote attempt to "attack" his service with a fast block of 100k requests? In other words, **you might have whatever accelerated-computing + unlimited connectivity** ( if you are willing to pay respective costs ) **and yet having 24 hrs Turn-Around-Time, as the service provider does not permit massive amounts of service-request and throttles 'em down...** — user3666197, Nov 08 '17 at 07:29
It's very possible, something I didn't even think of. It seems more and more this is pointing towards a network issue than a python GIL issue. — P.Cummings, Nov 08 '17 at 12:46

Running for loop in parallel via python

0 Answers0