3

I have a list of IP addresses in a df. These IP addresses are sent in GET requests to the ARIN database using requests, and I am interested in getting the organization or customer of that IP address. I am using a requests Session() inside of a requests-futures FuturesSession() to hopefully speed up the API calls. Here is the code:

s = requests.Session()
session = FuturesSession(session=s, max_workers=10)

def getIPAddressOrganization(IP_Address):
    url = 'https://whois.arin.net/rest/ip/' + IP_Address + '.json'
    request = session.get(url)
    response = request.result().json()
    try:
        organization = response['net']['orgRef']['@name']
    except KeyError:
        organization = response['net']['customerRef']['@name']
    return organization

df['organization'] = df['IP'].apply(getIPAddressOrganization)

Adding the regular requests Session() helped performance a lot, but the requests-futures FuturesSession() has not helped (likely due to my lack of knowledge).

How should pandas apply() be used in tandem with requests-futures, and/or is there another option for speeding up API calls that could be more effective?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • When I need to download resources from a pandas dataframe, I'll use a column method, like df['column'].tolist() and set that equal to a variable, then I'll use threads or multiprocessing to efficiently make the requests and then map the result back into the dataframe. – jrjames83 Jan 18 '18 at 16:54
  • sounds like a decent option - could you please expound upon that? – OverflowingTheGlass Jan 18 '18 at 17:11
  • Right, I just meant, take the pandas column with all your org IPS, put it in a list. Process the list using requests futures, or the multiprocessing module using a strategy like this https://stackoverflow.com/questions/8640367/python-manager-dict-in-multiprocessing, incorporating a process safe dict to map the IP address to your JSON parsing fun and then in pandas, df['result'] = df['ip'].map(lambda x: mydict.get(x, None)) – jrjames83 Jan 18 '18 at 17:49
  • so it's not possible to multiprocess a df column? in other words, the requests futures really isn't doing anything in my current code? – OverflowingTheGlass Jan 18 '18 at 18:22
  • That's my thinking - pandas may not move to the next row until a value has been returned from the function called. – jrjames83 Jan 18 '18 at 18:45
  • An easy way to verify would be to write a stupid function that sleeps for 5 seconds and returns True, then run for a dataframe with 10 rows and see if it takes 50 seconds, etc... – jrjames83 Jan 18 '18 at 19:10

1 Answers1

0

This does not directly answer the question, but it shows that pandas' apply() function does indeed wait for the result of each API call and does not parallelize or optimize for IO time:

import time
import pandas as pd


df = pd.DataFrame(data=range(10))
start = time.perf_counter()
df.apply(lambda r: time.sleep(5), axis=1)
end = time.perf_counter() - start

print(f'total time: {end}')

total time: 50.05315346799034

Conclusion - perhaps it's best to consider an async IO approach

A tentative direction:

async def parallel_rest_calls(data: List):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for ip in data:
            tasks.append(getIPAddressOrganization(session=session, ip)

        enriched_data_col = await asyncio.gather(*tasks, return_exceptions=True)
        return enriched_data_col


async def getIPAddressOrganization(session: aiohttp.ClientSession, IP_Address):
    url = 'https://whois.arin.net/rest/ip/' + IP_Address + '.json'
    async with session.get(url, headers=headers, params=params) as response:
        json = await response.json()
        status = response.status
        
        try:
            organization = json['net']['orgRef']['@name']
        except KeyError:
            organization = json['net']['customerRef']['@name']
        return (IP_Address, organization)
Joey Baruch
  • 4,180
  • 6
  • 34
  • 48