How to accelerate the speed of this pandas apply

Question

Pandas masters, here I have a question for you.

I am writing a function to convert the IP addresses to geolocation. I use a CSV file from IP2LOCATION as the lookup table. The lookup table looks like this:

	ip_start	ip_end	country_code	country_name	state_name	city_name
0	0	16777215	-	-	-	-
1	16777216	16777471	US	United States of America	California	Los Angeles
2	16777472	16778239	CN	China	Fujian	Fuzhou
3	16778240	16779263	AU	Australia	Victoria	Melbourne
4	16779264	16781311	CN	China	Guangdong	Guangzhou
...	...	...	...	...	...	...
2907906	3758094336	3758095359	HK	Hong Kong	Hong Kong	Hong Kong
2907907	3758095360	3758095871	CN	China	Fujian	Fuzhou
2907908	3758095872	3758096127	SG	Singapore	Singapore	Singapore
2907909	3758096128	3758096383	AU	Australia	Queensland	Brisbane
2907910	3758096384	4294967295	-	-	-	-

And I have 1.4M IP addresses in a dataframe. Now I want to traverse the DataFrame and lookup the city and country information. I did the following tests.

Test1

def get_city(ip_value, ip_lookup=ip_lookup):
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})

df.loc[:10000, 'ip_value'].apply(get_city)

It took me 2 minutes 30 seconds to lookup only 10,000 IPs.

Test2

country_names = []
state_names = []
city_names = []

for _, ip_value in df.loc[:10000, 'ip_value'].iteritems():
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    country_names.append(match['country_name'].item())
    state_names.append(match['state_name'].item())
    city_names.append(match['city_name'].item())

The speed of this code is almost the same as the previous one.

Test3

In this test, I set the 'ip_start' and 'ip_end' as indexes in 'ip_lookup' DataFrame.

ip_lookup.set_index(['ip_start', 'ip_end'], inplace=True)
ip_lookup.loc[(ip_lookup.index.get_level_values('ip_start') <= 1709327025) & (ip_lookup.index.get_level_values('ip_end') > 1709327025)]

This code works. But it is much slower than the previous two. It took 65ms to lookup only one IP.

Test4

In this test I use Dask.

import dask.dataframe as dd

df1 = dd.from_pandas(df.loc[:10000, 'ip_value'], npartitions=20)

def get_city(ip_value, ip_lookup=ip_lookup):
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})

df1 = df1.apply(get_city, meta={'country': 'object', 'state': 'object', 'city': 'object'})
df1.compute()

This code is a little bit faster. It took 55 seconds to execute.

I need your help, I have 1.4M IP address to lookup. I really need to accelerate it. Any idea to make it? Thanks.

Not sure if you've seen this, but it may help you - https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758 — Danail Petrov, Jan 01 '21 at 10:58

galaxyan · Accepted Answer · 2021-01-01T01:44:45.597

2

it seems you want to find the index faster than dataframe broadcast. try bisect.bisect_left, once you get the index than you could use df.iloc to get the geo location. in you case, you only need to get ip_start and check the left

import bisect
import pandas as pd
import numpy as np

lst = np.sort(np.random.randint(100000, size=10000))
%time bisect.bisect_left(lst, 50)
CPU times: user 22 µs, sys: 0 ns, total: 22 µs
Wall time: 25.3 µs

df = pd.DataFrame(lst, columns=['a'])
%time df[(df.a >30)&(df.a<60)] 
CPU times: user 2.03 ms, sys: 881 µs, total: 2.91 ms
Wall time: 2.51 ms

edited Jan 01 '21 at 01:44

answered Jan 01 '21 at 01:36

galaxyan

5,944
2
19
43

It will be faster if you don't use numpy data structures I believe – juanpa.arrivillaga Jan 01 '21 at 16:41
Thank you Galaxyan. I tested, it was fast. – Steven Li Jan 04 '21 at 16:47
@StevenLi :) yeah – galaxyan Jan 04 '21 at 18:32

How to accelerate the speed of this pandas apply

Test1

Test2

Test3

Test4

1 Answers1