0

I have two CSV files:

File 1

Id, 1st, 2nd
1, first, row
2, second, row

File 2

Id, 1st, 2nd
1, first, row
2, second, line
3, third, row

I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:

Output delta file

2, second, line
3, third, row
J. M. Arnold
  • 6,261
  • 3
  • 20
  • 38
tanatan
  • 31
  • 5

2 Answers2

1

I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').

import pandas as pd


df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")

merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")

For further details on finding differences in Pandas DataFrames, read this other question.

Enrico Gandini
  • 855
  • 5
  • 29
0

I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:

import csv
from io import StringIO

data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""

s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)

print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}

Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:

d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}
foglerit
  • 7,792
  • 8
  • 44
  • 64