My code meant to compare two CSVs in Python and print the differences is printing the similarities - what's wrong with my code?

Question

I currently have two CSVs. I'm trying to scan through each of them, compare the lines, and if there is a line in one that is not in the other, I'd like to print that line to a new CSV. As it stands now this is my code:

# compares the two files
with open('csv1.csv', 'r') as t1, open('csv2.csv', 'r') as t2: 
    fileone = t1.readlines()
    filetwo = t2.readlines()

# scans through the two files and writes differences to new csv
with open('csv3.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

csv1 has 201 rows while csv2 has 156, so I would expect csv3 to have 45 results, however instead it has 156 (the entirety of csv2). I tried switching the logic around to

for line in fileone:
    if line not in filetwo:
         outfile.write(line)

but that just made csv3 empty.

Any help would be appreciated!

also the code I used was sourced from Python : Compare two csv files and print out differences

Here's an example of a line that appears in both but still shows up in csv3

,MAJOR,MAJOR_CODE,
0,Accountancy,ACCT, 
1,Aerospace Engineering,AERO

Can you give an example of a line that is in both but ends up in the third file? — saedx1, Aug 24 '21 at 15:33
Hint: Your first attempt will never show a line from fileone because it only iterates over the lines from filetwo. Your second attempt does just the opposite. — Solomon Slow, Aug 24 '21 at 15:33
@saedx1 the actual data in the CSV is confidential but I'll edit the original question with something in a similar format — gabediq, Aug 24 '21 at 15:36
I think you need to enumerate fileone looking for items that don't exist in filetwo then write to the output file. You then need to enumerate filetwo looking for items that don't exist in fileone - i.e. you need two passes — , Aug 24 '21 at 15:42
@DarkKnight do you think you could provide an example or steer me towards some documentation? I'm still pretty new to Python — gabediq, Aug 24 '21 at 15:44
Do you care about duplicates? If the same line exists twice in file1 but never in file2, do you want it in file3 twice? — Pranav Hosangadi, Aug 24 '21 at 16:01
Good question @PranavHosangadi If duplicates are not wanted then the solution is to create sets from the readlines lists then enumerate those — , Aug 24 '21 at 16:03
@PranavHosangadi duplicates don't matter, they can be in file3 twice — gabediq, Aug 24 '21 at 16:04

score 0 · Answer 1 · answered Aug 24 '21 at 15:56

You need to do two passes.

csv1.csv contents:
A
B
D

csv2.csv contents:
A
B
C

So, in this case, lines A & B are in both files but D is in csv1.csv and not in csv2.csv and C is in csv2.csv but not in csv1.csv.

Therefore, the output file should look like this...

csv3.csv
D
C

Thus...

with open('csv1.csv') as c1, open('csv2.csv') as c2:
    c1lines = c1.readlines()
    c2lines = c2.readlines()
    with open('csv3.csv', 'w') as c3:
        for line in c1lines:
            if not line in c2lines:
                c3.write(line)
        for line in c2lines:
            if not line in c1lines:
                c3.write(line)

score 0 · Answer 2 · answered Aug 24 '21 at 16:06

You can use the set data structure, specifically its symmetric_difference() method.

Reasons to use set:

The symmetric_difference() method does exactly what you want: return the items in either set, but not in both.
Lookup is quick, so even if you don't want to use symmetric_difference(), it'll be much faster than doing line not in list_1.

s_file1 = set(fileone)
s_file2 = set(filetwo)

s_file3 = s_file1.symmetric_difference(s_file2)

Then, you can write the elements of s_file3 to your output file.

Note: sets are unordered, so this will not maintain the order of your lines. Sets also get rid of duplicates, so if a line occurs two times in file1 but zero times in file2, the duplicate will be removed and you'll only get said line one time in your output.

If you care about order and duplicates, and your files are large, it might still be faster to create sets and then use those to check for membership in your loops.

diff_lines = []
for line in fileone:
    if not line in s_file2:
        diff_lines.append(line)

for line in filetwo:
    if not line in s_file1:
        diff_lines.append(line)

And then write diff_lines to your output (or just write to output directly instead of appending to diff_lines).

My code meant to compare two CSVs in Python and print the differences is printing the similarities - what's wrong with my code?

2 Answers2