2

I have to csv files. The first looks like this:

enter image description here

The second contains a list of IP:

139.15.250.196
139.15.5.176

I'd like to check if any given IP in from the first file is in the second file. This seams to work (please correct or provide hints if my code is broken) but the issue is that the first file contains many duplicate values e.g. 10.0.0.1 may appear x times and I was not able to find a way to remove duplicates. Could you please assist me or guide ?

import csv

filename = 'ip2.csv'
with open(filename) as f:
    reader = csv.reader(f)
    ip = []
    for row in reader:
        ip.append(row[0])


filename = 'bonk_https.csv'
with open(filename) as f:
    reader = csv.reader(f)
    ip_ext = []
    for row in reader:
        ip_ext.append(row[0])
        for a in ip:
            if a in ip_ext:
                print(a)
postFix
  • 59
  • 4
  • Have you looked at the Pandas library? You could import the CSVs into Panda using the read_csv command. Likely deduplicate the list in Pandas. Then execute an inner join in Pandas with the merge command to get the list of matching items. – ChrisG Dec 10 '18 at 20:45
  • delete duplicates in Pandas: https://chrisalbon.com/python/data_wrangling/pandas_delete_duplicates/ – ChrisG Dec 10 '18 at 20:47
  • merge/join in Pandas: https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/ – ChrisG Dec 10 '18 at 20:49
  • 2
    Why don't you create a [set](https://docs.python.org/2/library/sets.html) of IPs instead of a list? – Aurora Wang Dec 10 '18 at 20:50
  • 3
    Your code clearly isn't what you're running; it'll die immediately with a `NameError` (because `reader` isn't defined). Can you post a [MCVE] that can actually run? – ShadowRanger Dec 10 '18 at 21:15
  • @ShadowRanger please appologize, it was a copy - past - wrrong tab issue. Corrected it. – postFix Dec 11 '18 at 06:01

2 Answers2

3

You can cast any list into a set with set(list). A set only holds one of each items and can be compared with member in set like a list. So just cast your ip list to a set.

with open(filename) as f:
    ip_ext = []
    for row in reader:
        ip_ext.append(row[0])
        for a in set(ip):
            if a in set(ip_ext): #well, you don't need a set her unless you also have duplicates in ip_ext
                print(a)

Alternatively just break/continue if you found your entry. This might help you with that

  • Thank you but with your code I'm still getting duplicates :( – postFix Dec 11 '18 at 05:59
  • 1
    please give us some example data and your code. I currently can't see how you can get duplicates if you compare each member of a set (which no longer has duplicates) exactly once with the ip_ext list that you made. Unless ip_ext itself has also duplicates. – not_a_bot_no_really_82353 Dec 12 '18 at 23:00
  • 1
    To be sure I updated my code. Please try it again. And please tell us more about your data. – not_a_bot_no_really_82353 Dec 12 '18 at 23:01
  • Thank you. In fact it works :) The second file contains duplicates but that's ok. Please see the EDIT section of my question. I hope you can help me with it ! – postFix Dec 15 '18 at 13:16
1

I suggest that you normalize all the IPs,

with open(...) as f
   # a set comprehension of _normalized_ ips, this strips excess trailing zeros
   my_ips = {'.'.join('%d'%int(n) for n in t) 
                for t in [x.split(',')[0].split('.') for x in f]}

Next, you check each normalized IP from rthe second file against the IP s contained in the normalized set (note that, different from other answers, here you have a single loop, and that checking if an item is a member of a set, x in my_xs, is a highly optimized operation)

with open(...) as f:
    for line in f:
        ip = '.'.join('%d'%int(n) for n in line.split('.'))
        if ip in my_ips:
            ...
        else:
            ...
gboffi
  • 22,939
  • 8
  • 54
  • 85