0

I have a process that loops over a list of IP addresses and returns some information about them. The simple for loop works great, my issue is running this at scale due to Python's Global Interpreter lock (GIL).

My goal is to have this function run in parallel and take full use of my 4 cores. This way when I run 100K of these it won't take me 24 hours via a normal for loop.

After reading others answers on here, particularly this one, How do I parallelize a simple Python loop?, I decided to use joblib. When I run 10 records thru it(example above), it took over 10 minutes to run. This doesn't sound like it's working right. I know there is something i'm doing wrong or not understanding. Any help is greatly appreciated!

import pandas as pd
import numpy as np
import os as os
from ipwhois import IPWhois
from joblib import Parallel, delayed
import multiprocessing

num_core = multiprocessing.cpu_count()

iplookup = ['174.192.22.197',\
            '70.197.71.201',\
            '174.195.146.248',\
            '70.197.15.130',\
            '174.208.14.133',\
            '174.238.132.139',\
            '174.204.16.10',\
            '104.132.11.82',\
            '24.1.202.86',\
            '216.4.58.18']

Normal for loop which works fine!

asn=[]
asnid=[]
asncountry=[]
asndesc=[]
asnemail = []
asnaddress = []
asncity = []
asnstate = []
asnzip = []
asndesc2 = []
ipaddr=[]
b=1
totstolookup=len(iplookup)

for i in iplookup:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

Function to to pass to joblib to run on all cores!

def run_ip_process(iplookuparray):
    asn=[]
    asnid=[]
    asncountry=[]
    asndesc=[]
    asnemail = []
    asnaddress = []
    asncity = []
    asnstate = []
    asnzip = []
    asndesc2 = []
    ipaddr=[]
    b=1
    totstolookup=len(iplookuparray)

for i in iplookuparray:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

ipdataframe = pd.DataFrame({'ipaddress':ipaddr,
              'asn': asn,
              'asnid':asnid,
              'asncountry':asncountry,
              'asndesc': asndesc,
            'emailcontact': asnemail,
              'address':asnaddress,
              'city':asncity,
              'state': asnstate,
                'zip': asnzip,
              'ipdescrip':asndesc2})

return ipdataframe 

run process using all cores via joblib

Parallel(n_jobs=num_core)(delayed(run_ip_process)(iplookuparray) for i in iplookup)
P.Cummings
  • 101
  • 1
  • 1
  • 9
  • If you are doing network communications, then the limiting factor will be network bandwidth, not CPU or the GIL. – Lawrence D'Oliveiro Nov 04 '17 at 02:24
  • I thought so at first as well, but my network is running 100mbps up and down, consistently. – P.Cummings Nov 04 '17 at 18:05
  • That’s what I mean: if you are already saturating your link, then multithreading won’t help. – Lawrence D'Oliveiro Nov 05 '17 at 01:50
  • Thanks! I guess AWS could help me out with this. – P.Cummings Nov 05 '17 at 15:32
  • Are you sure the service-provider has not implemented a safe-guarding policy to throttle-down any remote attempt to "attack" his service with a fast block of 100k requests? In other words, **you might have whatever accelerated-computing + unlimited connectivity** ( if you are willing to pay respective costs ) **and yet having 24 hrs Turn-Around-Time, as the service provider does not permit massive amounts of service-request and throttles 'em down...** – user3666197 Nov 08 '17 at 07:29
  • It's very possible, something I didn't even think of. It seems more and more this is pointing towards a network issue than a python GIL issue. – P.Cummings Nov 08 '17 at 12:46

0 Answers0