Finding diffs between two CSV files with primary column

Question

I have two CSV files:

File 1

Id, 1st, 2nd
1, first, row
2, second, row

File 2

Id, 1st, 2nd
1, first, row
2, second, line
3, third, row

I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:

Output delta file

2, second, line
3, third, row

score 1 · Accepted Answer · answered Feb 12 '21 at 11:28

I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').

import pandas as pd


df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")

merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")

For further details on finding differences in Pandas DataFrames, read this other question.

Thank you Enrico. I will use this approach as it seems to provide more options on merging. — tanatan, Feb 13 '21 at 05:24

score 0 · Answer 2 · answered Feb 12 '21 at 12:18

I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:

import csv
from io import StringIO

data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""

s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)

print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}

Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:

d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}

Thank you forlerit. New to Python and useful to know different ways of doing things — tanatan, Feb 13 '21 at 05:23

Finding diffs between two CSV files with primary column

2 Answers2