-3

I have a list of ~1,000,000 ip address strings. I want to get the set of these ip addresses that are in three cidrs (each cidr is a string like this: "1.0.0.0/25"). What is the fastest way to do this?

A) Convert the three cidrs into sets containing all ip addresses contained in the cidrs. For each ip address in my list, I check if the ip address is in the wanted ip address set.

B) Convert each cidr into min & max ip address. Convert each ip address into a tuple of ints and check if ip > min and ip < max.

Jonathan Allen Grant
  • 3,408
  • 6
  • 30
  • 53
  • 1
    Can you provide what you have done so far? Your code attempt? – iDrwish Apr 20 '18 at 17:54
  • 1
    Try them both and compare. – chepner Apr 20 '18 at 18:03
  • 1
    Short answer: Store them in a more efficient way. Just sorting them numerically will do -- then you can bisect the list to find the first and last entries within each span and spool between those pointers. – Charles Duffy Apr 20 '18 at 18:07
  • Possible duplicate of (though no answers received upvotes) [How to efficiently check if a given IP Address belong to an IP subnetwork in Python?](https://stackoverflow.com/q/44262437/364696) – ShadowRanger Apr 20 '18 at 19:44

2 Answers2

2

If you have only 3 CIDRs, just write three ad-hoc functions like:

def test_cidr1(ipstr):
    # example for 192.168.128.0/18 (i.e. netmask 255.255.192.0)
    if not ipstr.startswith('192.168.'):
        return False
    ip0, ip1, ip2, ip3 = ipstr.split('.')
    return int(ip2) & 192 == 128
VPfB
  • 14,927
  • 6
  • 41
  • 75
1

If you're on Python 3.3 or higher, a decent solution is to use the ipaddress module. Convert your CIDRs to network objects with ipaddress.ip_network up front, then convert your addresses to address objects (with ipaddress.ip_address if they might be IPv4 or IPv6, or just ipaddress.IPv4Address/ipaddress.IPv6Address directly if they are of known type (skips a layer of wrapping).

You can test for membership relatively cheaply with the in operator, e.g. if you stored your networks in a sequence (e.g. list/tuple) you could do:

for address in map(ipaddress.ip_address, stream_of_string_addresses):
    if any(address in network for network in networks):
        ... got a match ...

There are more efficient solutions (particularly if you're talking about many networks, not just three), but this is straightforward, relatively memory efficient, and leaves you with a useful object (not just the raw address string) for further processing.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • This looks to me like it's O(n*m), where n=address count and m=network count. Am I reading wrong? That's pretty unfortunate, insofar as requiring sorted inputs could make this far, far cheaper. – Charles Duffy Apr 20 '18 at 18:08
  • @CharlesDuffy: It's `O(n*m)`, where `n` is number of addresses and `m` is number of CIDRs. That said, `m` in this case is fixed at three, so it's not that critical to optimize further. If `m` was of arbitrarily large size, then yes, you'd want to look at other approaches. – ShadowRanger Apr 20 '18 at 18:09
  • It's the `n`, not the `m`, that's large -- so I'd want to avoid the `O(n)` aspect. If we could bisect to find start and end points for each CIDR and stream between them, there we are -- the only linear cost is that of the actual output. Which is to say, like in most things programming-related, better to think about getting the data structures right rather than getting the code right. :) – Charles Duffy Apr 20 '18 at 18:10
  • @CharlesDuffy: Yes, if you can force sorted inputs, then there are much faster ways to do this. For example, using the `bisect` module to bisect an array of `int` converted addresses with a lower bound of `int(network.network_address)` and an upper bound of `int(network.broadcast_address)` (which is what the `__contains__` check of ipaddress's network base class is checking) would get you down to `O(m * log n)` (converting to `int` would be `O(n)` in the first place though, since `bisect` doesn't allow `key` functions). – ShadowRanger Apr 20 '18 at 18:14