Comparing CSV and print out the different row from both the source and the target with Python

Question

Old csv file

Column1 (Column name: I will not compare the column name. The csv is not in order)

AA101
BB101
CC101
DD101
EE101

New csv file

Column2 (Column name: I will not compare the column name. The csv is not in order)

AA101
CC101
BB101
DD102
EE102

Expect result file:

Different:
Old
DD101 (it is not in the New file)
EE101 (it is not in the New file)
New
DD102 (it is not in the Old file)
DD101 (it is not in the Old file)

I reference this post and create the following code Comparing CSV matching rows with Python

import csv

Source_filename = "E:\Path\Source1.csv"
Target_filename = "E:\Path\Target1.csv"
output_filename = "E:results.csv"

# Load all the entries from Source into a set for quick lookup.
source_ids = set()

with open(Source_filename, 'r') as f:
    big_ip = csv.reader(f)
    for csv_row in big_ip:
        source_ids.add(csv_row[0])

# print source_ids

with open(Source_filename, 'r') as input_file, open(output_filename, 'w') as output_file:
    input_csv = csv.reader(input_file)
    output_csv = csv.writer(output_file)
    for csv_row in input_csv:
        ip = csv_row[0]
        status = "Present" if ip in source_ids else "Not Present"
        output_csv.writerow([ip, status + " in Source.csv"])

The code out put is both same and different from the source. I need out put for different only from both the source and the target

Hi Corralien, Could you provide the sample code to compare them? Thanks — Anson, Jul 20 '21 at 20:11

score 0 · Accepted Answer · answered Jul 20 '21 at 19:47

0

One option is to use Pandas. There are many ways to do this, here is one that will give you a complete list of all records with an "Indicator" column set to "both" (if the record appears in both files), "left_only" (if in old file), or "right_only" (if in new file). More on Pandas merge here:

import pandas as pd

old = pd.read_csv('old_file.csv')
new = pd.read_csv('new_file.csv')
output = old.merge(
    new,
    left_on='old_column_name',
    right_on='new_column_name',
    how='outer',
    indicator=True,
)
output.to_csv('output.csv')

You can also filter the indicator before saving to csv:

output[output['_merge'] != 'both'].to_csv('output.csv')

answered Jul 20 '21 at 19:47

jak123

318
2
12

KeyError Traceback (most recent call last) in – Anson Jul 20 '21 at 20:09
Hi jak123, I get above error message when I run the code. do you run the code? -Thanks – Anson Jul 20 '21 at 20:10
Did you change the left_on/right_on values to your column names? These should match what's in your files. – jak123 Jul 20 '21 at 23:36
Hi Jak123, Your solution works, but the result including both matched and different records. How to filter the match record out and only leave the different records in the output file? – Anson Jul 21 '21 at 13:10
Hi Anson: see the filter line at the end of my answer: ``output[output['_merge'] != 'both'].to_csv('output.csv')`` – jak123 Jul 21 '21 at 15:52
1

Hi Jak123, Thanks. I accept your answer. Great job! – Anson Jul 21 '21 at 16:46

score 0 · Answer 2 · answered Jul 20 '21 at 20:58

With Pandas and pd.merge:

>>> %cat Source1.csv
AA101
BB101
CC101
DD101
EE101

>>> %cat Target1.csv
AA101
CC101
BB101
DD102
EE102

# Python env: pip install pandas
# Anaconda env: conda install pandas
import pandas as pd

source = pd.read_csv('Source1.csv', names=['big_ip'], header=None)
target = pd.read_csv('Target1.csv', names=['big_ip'], header=None)

df = pd.merge(source, target, how='outer', indicator=True)

>>> df
  big_ip      _merge
0  AA101        both  # <- present both in source and target
1  BB101        both
2  CC101        both
3  DD101   left_only  # <- present in source only (old)
4  EE101   left_only
5  DD102  right_only  # <- present in target only (new)
6  EE102  right_only

The output can be customized to fit your need.

Comparing CSV and print out the different row from both the source and the target with Python

2 Answers2