4

I have a pandas dataframe of users and their ip addresses:

users_df = pd.DataFrame({'id': [1,2,3],
                         'ip': ['96.255.18.236','105.49.228.135','104.236.210.234']})

   id               ip
0   1    96.255.18.236
1   2   105.49.228.135
2   3  104.236.210.234

And a separate dataframe containing network ranges and corresponding geoname IDs:

geonames_df = pd.DataFrame({'network': ['96.255.18.0/24','105.49.224.0/19','104.236.128.0/17'],
                            'geoname': ['4360369.0','192950.0','5391959.0']})

     geoname           network
0  4360369.0    96.255.18.0/24
1   192950.0   105.49.224.0/19
2  5391959.0  104.236.128.0/17

For each user, I need to check their ip against all networks, and pull out the corresponding geoname and add it to users_df. I want this as output:

   id               ip   geonames
0   1    96.255.18.236  4360369.0
1   2   105.49.228.135   192950.0
2   3  104.236.210.234  5391959.0

In this example its easy because they're correctly ordered and only 3 examples. In reality, users_df has 4000 rows, and geonames_df has over 3 million

I'm currently using this:

import ipaddress

networks = []
for n in geonames_df['network']:
    networks.append(ipaddress.ip_network(n))

geonames = []

for idx, row in users_df.iterrows():
    ip_address = ipaddress.IPv4Address(row['ip'])

    for block in networks:
        if ip_address in block:
            geonames.append(str(geonames_df.loc[geonames_df['network'] == str(block), 'geoname'].item()))
            break

users_df['geonames'] = geonames

This is very slow because of the nested loop over the dataframe/list. Is there a faster way that leverages numpy/pandas? Or at least some way that is faster than the method above?

Theres a similar question about this (How can I check if an ip is in a network in python 2.x?), but 1) it doesnt involve pandas/numpy, 2) I want to check multiple IPs against multiple networks, and 3) the highest voted answer cant avoid a nested loop, which is where my slow performance stems from

Simon
  • 9,762
  • 15
  • 62
  • 119
  • I think the issue you are going to run into here is using the pandas functions on that inner for loop. If you could get it into a form that is more low-level then you could utilize Numba, Cython, etc. I don't see a way to speed up the outer for loop though, because you are calling a third party library. – pythonweb Sep 04 '18 at 05:07
  • How does `105.49.228.135` map to `105.49.224.0/19` range? – Zero Sep 04 '18 at 05:08

2 Answers2

1

I don't think that nested loop can be avoided but I have combined previous solution, mentioned in the comment, with pandas. You can check if it is any faster.

import socket,struct

def makeMask(n):
    "return a mask of n bits as a long integer"
    return (2<<n-1) - 1

def dottedQuadToNum(ip):
    "convert decimal dotted quad string to long integer"
    return struct.unpack('L',socket.inet_aton(ip))[0]

def networkMask(network):
    "Convert a network address to a long integer" 
    return dottedQuadToNum(network.split('/')[0]) & makeMask(int(network.split('/')[1]))

def whichNetwork(ip):
    "return the network to which the ip belongs"
    numIp = dottedQuadToNum(ip)
    for index,aRow in geonames_df.iterrows():
        if (numIp & aRow["Net"] == aRow["Net"]):
            return aRow["geoname"]
    return "Not Found"

geonames_df["Net"] = geonames_df["network"].map(networkMask)
users_df["geonames"] = users_df["ip"].map(whichNetwork)
Ajay Srivastava
  • 1,151
  • 11
  • 15
0

If you're willing to use R instead of Python, I've written an ipaddress package which can solve this problem. There's still an underlying loop, but it's implemented in C++ (much faster!)

library(tibble)
library(ipaddress)
library(fuzzyjoin)

addr <- tibble(
  id = 1:3,
  address = ip_address(c("96.255.18.236", "105.49.228.135", "104.236.210.234"))
)
nets <- tibble(
  network = ip_network(c("96.255.18.0/24", "105.49.224.0/19", "104.236.128.0/17")),
  geoname = c("4360369.0", "192950.0", "5391959.0")
)

fuzzy_left_join(addr, nets, c("address" = "network"), is_within)
#> # A tibble: 3 x 4
#>      id         address          network geoname  
#>   <int>       <ip_addr>       <ip_netwk> <chr>    
#> 1     1   96.255.18.236   96.255.18.0/24 4360369.0
#> 2     2  105.49.228.135  105.49.224.0/19 192950.0 
#> 3     3 104.236.210.234 104.236.128.0/17 5391959.0

Created on 2020-09-02 by the reprex package (v0.3.0)

David Hall
  • 515
  • 6
  • 13