Pandas masters, here I have a question for you.
I am writing a function to convert the IP addresses to geolocation. I use a CSV file from IP2LOCATION as the lookup table. The lookup table looks like this:
ip_start | ip_end | country_code | country_name | state_name | city_name | |
---|---|---|---|---|---|---|
0 | 0 | 16777215 | - | - | - | - |
1 | 16777216 | 16777471 | US | United States of America | California | Los Angeles |
2 | 16777472 | 16778239 | CN | China | Fujian | Fuzhou |
3 | 16778240 | 16779263 | AU | Australia | Victoria | Melbourne |
4 | 16779264 | 16781311 | CN | China | Guangdong | Guangzhou |
... | ... | ... | ... | ... | ... | ... |
2907906 | 3758094336 | 3758095359 | HK | Hong Kong | Hong Kong | Hong Kong |
2907907 | 3758095360 | 3758095871 | CN | China | Fujian | Fuzhou |
2907908 | 3758095872 | 3758096127 | SG | Singapore | Singapore | Singapore |
2907909 | 3758096128 | 3758096383 | AU | Australia | Queensland | Brisbane |
2907910 | 3758096384 | 4294967295 | - | - | - | - |
And I have 1.4M IP addresses in a dataframe. Now I want to traverse the DataFrame and lookup the city and country information. I did the following tests.
Test1
def get_city(ip_value, ip_lookup=ip_lookup):
cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
match = ip_lookup.loc[cond]
return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})
df.loc[:10000, 'ip_value'].apply(get_city)
It took me 2 minutes 30 seconds to lookup only 10,000 IPs.
Test2
country_names = []
state_names = []
city_names = []
for _, ip_value in df.loc[:10000, 'ip_value'].iteritems():
cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
match = ip_lookup.loc[cond]
country_names.append(match['country_name'].item())
state_names.append(match['state_name'].item())
city_names.append(match['city_name'].item())
The speed of this code is almost the same as the previous one.
Test3
In this test, I set the 'ip_start' and 'ip_end' as indexes in 'ip_lookup' DataFrame.
ip_lookup.set_index(['ip_start', 'ip_end'], inplace=True)
ip_lookup.loc[(ip_lookup.index.get_level_values('ip_start') <= 1709327025) & (ip_lookup.index.get_level_values('ip_end') > 1709327025)]
This code works. But it is much slower than the previous two. It took 65ms to lookup only one IP.
Test4
In this test I use Dask.
import dask.dataframe as dd
df1 = dd.from_pandas(df.loc[:10000, 'ip_value'], npartitions=20)
def get_city(ip_value, ip_lookup=ip_lookup):
cond = (ip_value >= ip_lookup['ip_start']) & (ip_value < ip_lookup['ip_end'])
match = ip_lookup.loc[cond]
return pd.Series({'country': match['country_name'].item(), 'state': match['state_name'].item(), 'city': match['city_name'].item()})
df1 = df1.apply(get_city, meta={'country': 'object', 'state': 'object', 'city': 'object'})
df1.compute()
This code is a little bit faster. It took 55 seconds to execute.
I need your help, I have 1.4M IP address to lookup. I really need to accelerate it. Any idea to make it? Thanks.