3

I have a set of about 200,000 IP Addresses and 10,000 subnets of the form(1.1.1.1/24). For every IP Address I need to check whether it belongs to one of these subnets, but since it is a such a large dataset and I have less computational power, I would like an efficient implementation for this.

On searching, one method I found was this (https://stackoverflow.com/a/820124/7995937):

from netaddr import IPNetwork, IPAddress
if IPAddress("192.168.0.1") in IPNetwork("192.168.0.0/24"):
     print "Yay!"

But since I have to loop this over 200,000 IP Addresses, and for each address loop over 10,000 subnets, I am unsure if this is efficient. My first doubt, is checking "IPAddress() in IPNetwork()" just a linear scan or is it optimized in some way?

The other solution I came up with was to make a list with all the IPs contained in the IP Subnets(which comes to about 13,000,000 IPs without duplicates), and then sorting it. If I do this, then in my loop over the 200,000 IP Addresses I only need to do a binary search for each IP, over a larger set of IP Addresses.

for ipMasked in ipsubnets:  # Here ipsubnets is the list of all subnets
        setUnmaskedIPs = [str(ip) for ip in IPNetwork(ipMasked)]
        ip_list = ip_list + setUnmaskedIPs
ip_list = list(set(ip_list))  # To eliminate duplicates
ip_list.sort()

I could then just perform binary search in the following manner:

for ip in myIPList:  # myIPList is the list of 200,000 IPs
    if bin_search(ip,ip_list):
        print('The ip is present')

Is this method more efficient than the other one? Or is there any other more efficient way to perform this task?

Arjun Balgovind
  • 596
  • 1
  • 5
  • 13
  • As said before, the fastest is to use sets. Other topic about it: https://stackoverflow.com/questions/5993621/fastest-way-to-search-a-list-in-python – Łukasz Szczesiak May 30 '17 at 12:52
  • It's trivial to turn an IPv4 string into a 32-bits int so if I had to create a library like that I'd probably use ints and binary operators internally, which would be quite efficient. Like always you should measure to see if there is really a performance problem first. – polku May 30 '17 at 13:30

3 Answers3

0

This is probably not the best possible solution, but I'd suggest using a set rather than a list. Sets are optimized for checking if any given value is present in the set, so you're replacing your binary search with a single operation. Instead of:

ip_list = list(set(ip_list))

just do:

ip_set = set(ip_list)

and then the other part of your code becomes:

for ip in myIPList:  # myIPList is the list of 200,000 IPs
    if ip in ip_set:
        print('The ip is present')

Edit: and to make things a bit more memory-efficient you can skip creating an intermediate list as well:

ip_set = set()
for ipMasked in ipsubnets: 
    ip_set.update([str(ip) for ip in IPNetwork(ipMasked)])
Błotosmętek
  • 12,717
  • 19
  • 29
0

Okay, So sorting takes O(nlogn), In case of 13,000,000 you end up doing O(13000000log(13000000)). Then You are iterating over 200000 IP and doing binary search O(logn) on that sorted list on 13000000. I sincerely doubt that's best solution. I suggest you use map

from netaddr import IPNetwork, IPAddress
l_ip_address = map(IPAddress, list_of_ip_address)
l_ip_subnet = map(IPNetwork, list_of_subnets)

if any(x in y for x in l_ip_address for y in l_ip_subnet):
    print "FOUND"
Ishan Bhatt
  • 9,287
  • 6
  • 23
  • 44
  • Can you elaborate on what exactly the map does? And how does it improve the complexity if we are this looping over `x in l_ip_address` and `y in l_ip_subnet`? – Arjun Balgovind May 31 '17 at 04:27
  • map creates another list of type IPAddress from list of IP Address string. So it saves you converting string to IPAddress every time in the loop. – Ishan Bhatt May 31 '17 at 06:07
0

Your IP address in in a subnet if N leading bits of that address match N leading bits of one of the N-bit subnets. So, start by making a list of empty sets. Encode each subnet as a 32-bit integer with the trailing bits masked out. For example, 1.2.3.4/23 equals (0x01020304 & 0xfffffe00) equals 0x01020200. Add this number to the 23rd set in the list, ie subnets[23]. Continue for all the subnets.

To see if an IP address is in your subnets, encode the IP address in the same way as a 32-bit number ipaddr and then (something like, untested code)

for N in range( 32, 0, -1)
    mask = ( 0xffffffff >> (32-N) ) << (32-N)
    if (ipaddr & mask) in subnets[N] :
        # have found ipaddr in one of our subnets
        break # or do whatever...
else
    # have not found  ipaddr

Looking up a number in a set at worst O(log N) where N in the number of elements in the set. This code does it at most 32 times for the worst case of an ip address that is not in the sets of subnets. If the majority of the addresses are expected to be present, there's an optimisation to test the sets with the most elemnnts first. That might be

for N in ( 24, 16, 8, 29, 23, 28, 27, 26, 25, 22, 15, 21 ... )

or you could calculate the optimal sequence at runtime.

nigel222
  • 7,582
  • 1
  • 14
  • 22