1

I have two dataframes in pandas.

import pandas as pd
inp1 = [{'network':'1.0.0.0/24', 'A':1, 'B':2}, {'network':'5.46.8.0/23', 'A':3, 'B':4}, {'network':'78.212.13.0/24', 'A':5, 'B':6}]
df1 = pd.DataFrame(inp)
print("df1", df1)

inp2 = [{'ip':'1.0.0.10'}, {'ip':'blahblahblah'}, {'ip':'78.212.13.249'}]
df2 = pd.DataFrame(inp2)
print("df2", df2)

Output:

          network  A  B
0      1.0.0.0/24  1  2
1     5.46.8.0/23  3  4
2  78.212.13.0/24  5  6
              ip
0       1.0.0.10
1   blahblahblah
2  78.212.13.249

The ultimate output I want would appear as follows:

         ip              A           B
0       1.0.0.10         1           2
1   blahblahblah         NaN         Nan
2  78.212.13.249         5           6

I want to iterate through each cell in df2['ip'] and check if it belongs to a network in df1['network']. If it belongs to a network, it would return the corresponding A and B column for the specific ip address. I have referenced this article and considered netaddr, IPNetwork, IPAddress, ipaddress but cannot quite figure it out.

Help appreciated!

mk2080
  • 872
  • 1
  • 8
  • 21
  • `df1.merge(df2, left_on='network', right_on='ip', how='right')`? – Quang Hoang Jan 08 '20 at 20:10
  • @QuangHoang please note that the columns 'Network' and 'ip' are not the same. Here is a link with more explanation: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing – mk2080 Jan 08 '20 at 20:21

2 Answers2

2

You can do it using netaddr + apply(). Here is an example:

from netaddr import IPNetwork, IPAddress, AddrFormatError


network_df = pd.DataFrame([
    {'network': '1.0.0.0/24', 'A': 1, 'B': 2},
    {'network': '5.46.8.0/23', 'A': 3, 'B': 4},
    {'network': '78.212.13.0/24', 'A': 5, 'B': 6}
])
ip_df = pd.DataFrame([{'ip': '1.0.0.10'}, {'ip': 'blahblahblah'}, {'ip': '78.212.13.249'}])
# create all networks using netaddr
networks = (IPNetwork(n) for n in network_df.network.to_list())


def find_network(ip):
    # return empty string when bad/wrong IP
    try:
        ip_address = IPAddress(ip)
    except AddrFormatError:
        return ''
    # return network name as string if we found network
    for network in networks:
        if ip_address in network:
            return str(network.cidr)
    return ''


# add network column. set network names by ip column
ip_df['network'] = ip_df['ip'].apply(find_network)
# just merge by network columns(str in both dataframes)
result = pd.merge(ip_df, network_df, how='left', on='network')
# you don't need network column in expected output...
result = result.drop(columns=['network'])
print(result)
#               ip    A    B
# 0       1.0.0.10  1.0  2.0
# 1   blahblahblah  NaN  NaN
# 2  78.212.13.249  5.0  6.0

See comments. Hope this helps.

Danila Ganchar
  • 10,266
  • 13
  • 49
  • 75
0

If you're willing to use R instead of Python, I've written an ipaddress package which can solve this problem. There's still an underlying loop, but it's implemented in C++ (much faster!)

library(tibble)
library(ipaddress)
library(fuzzyjoin)

addr <- tibble(
  address = ip_address(c("1.0.0.10", "blahblahblah", "78.212.13.249"))
)
#> Warning: Problem on row 2: blahblahblah
nets <- tibble(
  network = ip_network(c("1.0.0.0/24", "5.46.8.0/23", "78.212.13.0/24")),
  A = c(1, 3, 5),
  B = c(2, 4, 6)
)

fuzzy_left_join(addr, nets, c("address" = "network"), is_within)
#> # A tibble: 3 x 4
#>         address        network     A     B
#>       <ip_addr>     <ip_netwk> <dbl> <dbl>
#> 1      1.0.0.10     1.0.0.0/24     1     2
#> 2            NA             NA    NA    NA
#> 3 78.212.13.249 78.212.13.0/24     5     6

Created on 2020-09-02 by the reprex package (v0.3.0)

David Hall
  • 515
  • 6
  • 13