-3

I have 2 CSV files. One with Hostnames and IPs. The second with IP information (Netmask, Cidr, Subnet Name etc..)

I currently have a situation like the following.

for i, ipAddress in CSV1:
       for j, ipNetwork in CSV2:
             if ipAddress in ipNetwork:
                  append ipNetwork['Subnet Name']

Unfortunately, the first csv has 9000 IPs and the second list has 30'000 subnets. This is taking a huge amount of time to iterate through. I know this was an awful way of implementing this but I knew I could always improve.

Can anyone advise how better I can solve this problem? How can I search through and compare elements in each to shorten the runtime of this script?

Here is the sample data

[CSV 1 - Sample IP]

IP Address
144.196.86.89
56.144.25.138
3.16.101.238
123.18.128.50
19.22.2.124
78.88.241.163
144.44.200.20
27.215.172.218
124.90.163.19

[CSV 2 - Sample Subnet Information]

address netmask Company Subnet Name Compartment Type    cidr
10.2.1.0    255.255.255.0   UPS UPS Site 1  Desktop 10.2.1.0/24
10.2.2.0    255.255.255.0   UPS UPS Site 2  Desktop 10.2.2.0/24
10.2.3.0    255.255.255.0   UPS UPS Site 3  Desktop 10.2.3.0/24
10.2.4.0    255.255.255.0   UPS UPS Site 4  Desktop 10.2.4.0/24
10.2.5.0    255.255.255.0   UPS UPS Site 5  Desktop 10.2.5.0/24
10.2.6.0    255.255.255.0   UPS UPS Site 6  Desktop 10.2.6.0/24
10.2.7.0    255.255.255.0   UPS UPS Site 7  Desktop 10.2.7.0/24
10.2.8.0    255.255.255.0   UPS UPS Site 8  Desktop 10.2.8.0/24
10.2.10.0   255.255.254.0   UPS UPS Site 9  Desktop 10.2.10.0/23
10.2.12.0   255.255.255.0   UPS UPS Site 10 Desktop 10.2.12.0/24
10.2.13.0   255.255.255.0   UPS UPS Site 11 Desktop 10.2.13.0/24
smci
  • 32,567
  • 20
  • 113
  • 146
Plisken
  • 423
  • 1
  • 4
  • 20
  • Can you share some data? Please see [mcve]. – jpp Apr 23 '18 at 22:33
  • @jpp I don't see why you downvoted his question. I think it has enough information to warrant a solution or at least ideas – Haris Nadeem Apr 23 '18 at 22:51
  • So one suggestion I would recommend is to cast the IP addresses and subnets (assuming they are all expressed as Strings) to either an integer or to bytes. The reason is that Strings are immutable and string comparison is an expensive operation. You might be able to get some ideas [here](https://stackoverflow.com/questions/10283703/conversion-of-ip-address-to-integer) – Haris Nadeem Apr 23 '18 at 22:55
  • 2
    The main problem is I don't know *why your current method is so slow*. Are you using native Python `open`? `csv` module? `pandas`? How do we know if a solution we offer is the slow one you currently use? – jpp Apr 23 '18 at 22:56
  • @jpp agreed. That is a valid point and one I hadn't thought of. Sorry, I didn't mean to accuse or anything. I just enjoy brainstorming ideas and possible solutions. But I totally see where you are coming from and agree a lot of time could be saved from providing more details – Haris Nadeem Apr 23 '18 at 22:58
  • Yes the OP should post a snippet of reproducible data, but: Clearly **it's slow because the OP has three nested loops**: two levels of nested iteration (i, j) over CSV1 file, and then an iteration over ipNetwork. – smci Apr 23 '18 at 23:25
  • Will make some sample data – Plisken Apr 23 '18 at 23:30
  • As to optimizing the IP address comparison, there are tons of existing SO near-duplicates: **[Python: Efficient way to compare ip addresses in a csv file](https://stackoverflow.com/questions/30838776/python-efficient-way-to-compare-ip-addresses-in-a-csv-file)** using regex, **[Compare IP List to another IP List or IP Range](https://stackoverflow.com/questions/40185170/compare-ip-list-to-another-ip-list-or-ip-range)** using a custom class, [Checking if IP address lies within a given range](https://stackoverflow.com/questions/41627920/checking-if-ip-lies-within-a-given-range) and tons more... – smci Apr 23 '18 at 23:33
  • Sorry, it took so long, does anyone want the sample data? – Plisken Apr 23 '18 at 23:52
  • @jpp As per the tags, I am using pandas. Also the dataframes is Pandas specific. – Plisken Apr 23 '18 at 23:53
  • 2
    @Plisken, So what are `CSV1` / `CSV2` ? Please supply a **[mcve]**. – jpp Apr 23 '18 at 23:54
  • Yes please post reproducible snippets for CSV1,2 before they all blow a gasket... :) Honestly, people here want to help you... it's frustrating when it's not reproducible. – smci Apr 23 '18 at 23:59
  • Where can I give you the samples? – Plisken Apr 24 '18 at 00:00
  • Edit the data into the question above. Just give a few lines, enough to be reproducible. Post the actual data here; don't use images, Dropbox, GDrive or third-party links. – smci Apr 24 '18 at 00:02
  • Updated question. I appreciate the help. – Plisken Apr 24 '18 at 00:06
  • Motivated by this and how hard it is tracking down related canonicals: [How to organize and assign canonicals for “Python/pandas compare IP address/CIDR”?](https://meta.stackoverflow.com/questions/366453/how-to-organize-and-assign-canonicals-for-python-pandas-compare-ip-address-cidr). I also list canonicals there. Essentially what's being asked here is a duplicate of several of those, but they're so badly disorganized it's legitimately near-impossible to find and cite them. – smci Apr 24 '18 at 00:52

2 Answers2

0

1) Clearly it's slow because the OP has three nested loops: two levels of nested iteration (i, j) over CSV1 file, and then an iteration over ipNetwork. You should avoid deep nesting like the plague.

To test membership in CSV1, do a first iteration over CSV1 to read in all the ips, and store as a set:

ips = [line[0] for line in CSV1.readlines()] # or whatever, using csv.reader
ips = set(ips)

In fact recent versions of Python allow you to generate the set directly (this is called a generator expression)

ips = set(line[0] for line in CSV1.readlines())

Now you won't need the j-iteration at all, because you're using a set. You just test membership with if ip in ... So we've eliminated one of your three nested loops. Maybe two.

2) Also to get performance gains when you test if ip in ipNetworks..., you could use regexes and/or a hierarchical ipAddress class, e.g. allow comparing ipAddress('157.55.130') to ipAddress('157.55.*.*). See the many existing duplicates on SO on comparing IPs. You could allow wildcards * and/or xxx. If you use wildcards and merge neighboring addresses, you should be able to compress both CSV1 and ipNetwork lengths below 9K;30K respectively.

See the many existing SO questions, such as: 1. using regex Python: Efficient way to compare ip addresses in a csv file 1. using a custom class Compare IP List to another IP List or IP Range 1. comparing with a range Checking if IP address lies within a given range

smci
  • 32,567
  • 20
  • 113
  • 146
  • Essentially what's being asked here is a duplicate of several existing questions, but they're so badly disorganized it's legitimately near-impossible to find and cite them. So instead of reanswering yet again, I put effort into: [How to organize and assign canonicals for “Python/pandas compare IP address/CIDR”?](https://meta.stackoverflow.com/questions/366453/how-to-organize-and-assign-canonicals-for-python-pandas-compare-ip-address-cidr). I also list a few possible canonicals there. This topic is one hell of a mess and needs cleanup volunteers and coherent tagging. – smci Apr 24 '18 at 00:54
0

There were some great pieces of advice here and it gave me a lot to look at.

What I ended up doing was ordering both lists by IP. Once I was comparing a sorted list of IPs and looking for an IP network it was in in the CIDR list I would set that as a temp variable. A lot of the IPs are consecutive so it is likely the next few would also be in the temp.

I then basically narrow the search range to the index of the found CIDR. So after every iteration, my CIDR list search decreases.

It may not be pretty or the most optimized but it works.

I will include a sample of it once I polish it up a bit.

Plisken
  • 423
  • 1
  • 4
  • 20