The csv comparison is excluding duplicates in python

Question

I have two csv files and i am using one csv to search all records from another csv and update its status. I have two tables and looking for IP and PROTOCOL from search table in input.csv. If it is there, then the EXISTS column is updated to 'No'. I am stuck where both the protocol and ip are same for two or more records, but they have different port. It is updating only one record.

import csv

IP, EXISTS, PROTOCOL = 'IP', 'Exists', 'Protocol'  # Field names referenced.

# Read entire input file into a list.
with open('input.csv', 'r', newline='') as inp:
    reader = csv.DictReader(inp)
    inputs = list(reader)

# Update input rows that match data in search.csv file.
with open('search.csv', 'r', newline='') as sea:
    sea_reader = csv.DictReader(sea)
    for row in sea_reader:
        protocol, ip = row[PROTOCOL], row[IP]
        for input_ in inputs:
            if input_[PROTOCOL] == protocol and input_[IP] == ip:  # Match?
                input_[EXISTS] = 'No'
                break

# Write updated input.csv data out into a file.
with open('input_updated.csv', 'w', newline='') as outp:
    fieldnames = inputs[0]
    writer = csv.DictWriter(outp, fieldnames)
    writer.writeheader()
    for input_ in inputs:
        writer.writerow(input_)

print('done')

Input.csv

Name	IP	Protocol	Port
l1	192.132.16.02	HTTP	80
l2	192.132.16.03	SSL	8443
l3	192.132.16.03	SSL	443
l4	192.132.16.04	SSL	443

search.csv

No	Protocol	IP	Port
1	HTTP	192.132.16.02	80
2	SSL	192.132.16.03	443
3	SSL	192.132.16.03	8443

Here in search.csv there are two records which has same protocol and ip but different port. In the result it is just including one record and not both. I tried adding port in the condition too but it does not work

I expect result to be exactly like input.csv and in 'Exists' column the value for first three rows will be 'No' — Fatilearns, Aug 31 '21 at 06:46

score 0 · Answer 1 · answered Aug 31 '21 at 07:41

You can use pandas to read the csv files. Create indexes on both IP, Protocol for both files. Iterate the search_df and insert value into the Exists column as True if index of search_df is found in input_df. Reset the index and drop nan values.


import pandas as pd
input_df = pd.read_csv("input.csv")
search_df = pd.read_csv("search.csv")
input_df = input_df.set_index(["IP","Protocol"])
search_df = search_df.set_index(["IP","Protocol"])
for index, _ in search_df.iterrows():
    input_df.loc[index,"Exists"] = True
input_df = input_df.reset_index()
input_df = input_df.dropna()

score 0 · Accepted Answer · answered Aug 31 '21 at 08:58

The only issue in your code is that you break out of the loop as soon as you find a match, so you will never find further matches. Removing the break will make it work, although it will be slower of course.

One thing is unclear to me: is the port number relevant? In your example the second input line would be matched even if the third search line didn't exist. So it would probably be a good idea to include the port number in the match test:

with open('search.csv', 'r', newline='') as sea:
    sea_reader = csv.DictReader(sea)
    for row in sea_reader:
        protocol, ip, port = row[PROTOCOL], row[IP], row[PORT]
        for input_ in inputs:
            if input_[PROTOCOL] == protocol and input_[IP] == ip and input_[PORT] == port:  # Match?
                input_[EXISTS] = 'No'
                break

In this case the break can remain, cause you will have only exact matches (i.e. input #2 with search #3 and input #3 with search #2)

The csv comparison is excluding duplicates in python

2 Answers2