1

Pandas masters, here I have a question for you.

I am writing a function to convert the IP addresses to geolocation. I use a CSV file from IP2LOCATION as the lookup table. The lookup table looks like this:

ip_start ip_end country_code country_name state_name city_name
0 0 16777215 - - - -
1 16777216 16777471 US United States of America California Los Angeles
2 16777472 16778239 CN China Fujian Fuzhou
3 16778240 16779263 AU Australia Victoria Melbourne
4 16779264 16781311 CN China Guangdong Guangzhou
... ... ... ... ... ... ...
2907906 3758094336 3758095359 HK Hong Kong Hong Kong Hong Kong
2907907 3758095360 3758095871 CN China Fujian Fuzhou
2907908 3758095872 3758096127 SG Singapore Singapore Singapore
2907909 3758096128 3758096383 AU Australia Queensland Brisbane
2907910 3758096384 4294967295 - - - -

And I have 1.4M IP addresses in a dataframe. Now I want to traverse the DataFrame and lookup the city and country information. I did the following tests.

Test1

def get_city(ip_value, ip_lookup=ip_lookup):
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})

df.loc[:10000, 'ip_value'].apply(get_city)

It took me 2 minutes 30 seconds to lookup only 10,000 IPs.

Test2

country_names = []
state_names = []
city_names = []

for _, ip_value in df.loc[:10000, 'ip_value'].iteritems():
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    country_names.append(match['country_name'].item())
    state_names.append(match['state_name'].item())
    city_names.append(match['city_name'].item())

The speed of this code is almost the same as the previous one.

Test3

In this test, I set the 'ip_start' and 'ip_end' as indexes in 'ip_lookup' DataFrame.

ip_lookup.set_index(['ip_start', 'ip_end'], inplace=True)
ip_lookup.loc[(ip_lookup.index.get_level_values('ip_start') <= 1709327025) & (ip_lookup.index.get_level_values('ip_end') > 1709327025)]

This code works. But it is much slower than the previous two. It took 65ms to lookup only one IP.

Test4

In this test I use Dask.

import dask.dataframe as dd

df1 = dd.from_pandas(df.loc[:10000, 'ip_value'], npartitions=20)

def get_city(ip_value, ip_lookup=ip_lookup):
    cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
    match = ip_lookup.loc[cond]
    return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})

df1 = df1.apply(get_city, meta={'country': 'object', 'state': 'object', 'city': 'object'})
df1.compute()

This code is a little bit faster. It took 55 seconds to execute.

I need your help, I have 1.4M IP address to lookup. I really need to accelerate it. Any idea to make it? Thanks.

Steven Li
  • 13
  • 3
  • Not sure if you've seen this, but it may help you - https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758 – Danail Petrov Jan 01 '21 at 10:58

1 Answers1

2

it seems you want to find the index faster than dataframe broadcast. try bisect.bisect_left, once you get the index than you could use df.iloc to get the geo location. in you case, you only need to get ip_start and check the left

import bisect
import pandas as pd
import numpy as np

lst = np.sort(np.random.randint(100000, size=10000))
%time bisect.bisect_left(lst, 50)
CPU times: user 22 µs, sys: 0 ns, total: 22 µs
Wall time: 25.3 µs

df = pd.DataFrame(lst, columns=['a'])
%time df[(df.a >30)&(df.a<60)] 
CPU times: user 2.03 ms, sys: 881 µs, total: 2.91 ms
Wall time: 2.51 ms
galaxyan
  • 5,944
  • 2
  • 19
  • 43