Comparing two csv files and getting difference

Question

I have two csv file I need to compare and then spit out the differnces:

CSV FORMAT:

 Name   Produce   Number
 Adam   Apple     5
 Tom    Orange    4
 Adam   Orange    11

I need to compare the two csv files and then tell me if there is a difference between Adams apples on sheet and sheet 2 and do that for all names and produce numbers. Both CSV files will be formated the same.

Any pointers will be greatly appreciated

You've tagged this with `excel` but mention CSV files. Do you need to work with xlsx or xls files? You might find that `diff` works for want you need, but you haven't really said whether this needs to be done a lot and build into an existing python program. — ChrisP, Jun 19 '12 at 20:19

score 7 · Answer 1 · answered Jul 23 '16 at 12:48

7

I have used csvdiff

$pip install csvdiff
$csvdiff --style=compact col1 a.csv b.csv

Link to package on pypi

I found this link useful

answered Jul 23 '16 at 12:48

Aakash Gupta

716
6
11

score 5 · Answer 2 · edited Jun 19 '12 at 21:05

5

If your CSV files aren't so large they'll bring your machine to its knees if you load them into memory, then you could try something like:

import csv
csv1 = list(csv.DictReader(open('file1.csv')))
csv2 = list(csv.DictReader(open('file2.csv')))
set1 = set(csv1)
set2 = set(csv2)
print set1 - set2 # in 1, not in 2
print set2 - set1 # in 2, not in 1
print set1 & set2 # in both

For large files, you could load them into a SQLite3 database and use SQL queries to do the same, or sort by relevant keys and then do a match-merge.

edited Jun 19 '12 at 21:05

octopusgrabbus

10,555
15
68
131

answered Jun 19 '12 at 20:40

Jon Clements

138,671
33
247
280

1

The dicts in the csv1 list are not hashable so creating set1 will not be possible. This can be avoided by conversion of the dicts to strings with [json.dumps](http://docs.python.org/2/library/json.html) – Henrik K Apr 29 '13 at 09:32
1

@HK_CK okay I am happy for you to add that to answer... just not change it as you suggedted... – Jon Clements Apr 29 '13 at 09:35
4

TypeError: unhashable type: 'dict'. Come on! – 3pitt Sep 20 '17 at 15:15

score 1 · Answer 3 · edited May 23 '17 at 12:18

1

One of the best utilities for comparing two different files is diff.

See Python implementation here: Comparing two .txt files using difflib in Python

edited May 23 '17 at 12:18

Community

1
1

answered Jun 19 '12 at 20:21

SomeKittens

38,868
19
114
143

Hugh Bothwell · Answer 4 · 2012-06-19T21:02:46.450

import csv

def load_csv_to_dict(fname, get_key, get_data):
    with open(fname, 'rb') as inf:
        incsv = csv.reader(inf)
        incsv.next()  # skip header
        return {get_key(row):get_data(row) for row in incsv}

def main():
    key = lambda r: tuple(r[0:2])
    data = lambda r: int(r[2])
    f1 = load_csv_to_dict('file1.csv', key, data)
    f2 = load_csv_to_dict('file2.csv', key, data)

    f1keys = set(f1.iterkeys())
    f2keys = set(f2.iterkeys())

    print("Keys in file1 but not file2:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f1keys-f2keys)))

    print("Keys in file2 but not file1:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f2keys-f1keys)))

    print("Differing values:")
    for k in (f1keys & f2keys):
        a,b = f1[k], f2[k]
        if a != b:
            print("{}:{} {} <> {}".format(k[0],k[1], a, b))

if __name__=="__main__":
    main()

When I try this in IDLE I get this: http://pastebin.com/6U035ERr (Using pastebin so you can see the whole error message with formatting) — Jeremy Pridemore, Jun 19 '12 at 20:48

score 1 · Answer 5 · answered Jun 20 '12 at 13:00

If you want to use Python's csv module along with a function generator, you can use nested looping and compare large .csv files. The example below compares each row using a cursory comparision:

import csv

def csv_lazy_get(csvfile):
    with open(csvfile) as f:
        r = csv.reader(f)
        for row in r:
            yield row

def csv_cmp_lazy(csvfile1, csvfile2):
    gen_2 = csv_lazy_get(csvfile2)

    for row_1 in csv_lazy_get(csvfile1):
        row_2 = gen_2.next()

        print("row_1: ", row_1)
        print("row_2: ", row_2)

        if row_2 == row_1:
            print("row_1 is equal to row_2.")
        else:
            print("row_1 is not equal to row_2.")

    gen_2.close()

score 0 · Answer 6 · answered Jun 19 '12 at 20:26

Here a start that does not use difflib. It is really just a point to build from because maybe Adam and apples appear twice on the sheet; can you ensure that is not the case? Should the apples be summed, or is that an error?

import csv
fsock = open('sheet.csv','rU')
rdr = csv.reader(fsock)
sheet1 = {}
for row in rdr:
    name, produce, amount = row
    sheet1[(name, produce)] = int(amount) # always an integer?
fsock.close()
# repeat the above for the second sheet, then compare

You get the idea?

Comparing two csv files and getting difference

6 Answers6

Linked