Pythonic way to Compare two CSV files to track changes

Question

I have a Python Script that generate a CSV (data parsed from a website). Here is an exemple of the CSV file:

File1.csv

China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;

File2.csv

China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;

As you can see,

China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).

I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).

What I'd like to have is a script which can tell me: - One new line : This;Is;A;New;Line;; - One removed line : China;Beijing;BeiwaiOnline BFSU;;; And so on ... !

I've tried but without any success:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import csv

f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')

c1 = csv.reader(f1)
c2 = csv.reader(f2)

now = [row for row in c2]
past = [row for row in c1]

for row in now:
    #print row
    lol = past.index(row)
    print lol

f1.close()
f2.close()

_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Any idea of the best way to proceed ? Thank you so much in advance ;)

EDIT:

import csv

f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')

c1 = csv.reader(f1)
c2 = csv.reader(f2)

s1 = set(c1)
s2 = set(c2)

lol = s1 - s2
print type(lol)
print lol

This seems to be a good idea but :

Traceback (most recent call last):
  File "compare.py", line 20, in <module>
    s1 = set(c1)
TypeError: unhashable type: 'list'

EDIT 2 (Please don't care about what is above): *with your help, here is the script I'm writing :*

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv


### COMPARISON THING ###
x=0

fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
    if '.csv' in fichier:
        print('%s -----> %s' % (x,fichier))
        x=x+1

choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file

s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE

added = [";".join(row) for row in s_now - s_past]  # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now]  # in "past" but not in "now"

c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
    line.append(item_added)
    c.writerow(['AD',item_added])

line = ['RM']
for item_removed in removed:
    line.append(item_removed)
    c.writerow(line)

Two kind of errors:

  File "programcompare.py", line 21, in <genexpr>
    s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte

or

  File "programcompare.py", line 21, in <genexpr>
    s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string

It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)

Sorry, last question !

Try [`difflib`](http://docs.python.org/library/difflib.html). — Li-aung Yip, May 23 '12 at 15:24
If you only care about the lines themselves (but not about their fields), you do not need to actually parse the CSV files. Just compare the lines. — betabandido, May 23 '12 at 15:24

score 7 · Accepted Answer · edited May 23 '17 at 12:20

If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:

s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))

To get the list of entries that were added:

added = [";".join(row) for row in s_now - s_past]  # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.

similarly, list of entries that were removed:

removed = [";".join(row) for row in s_past - s_now]  # in "past" but not in "now"

To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.

To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.

To address the additional errors you're seeing, they are both due to the content of your CSV files.

_csv.Error: newline inside string

It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.

I did however manage to reproduce the error as such:

>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: newline inside string

In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)

>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]

_csv.Error: line contains NULL byte

I'm guessing this might be down to the encoding of your CSV files. See the following questions:

Thank you SO much, you just ROCKS ! It works like a charm ;) I was looking forward for a solution like that for more than three hours, you save my night ;) — Carto_, May 23 '12 at 15:45
@Carto_ See updates in the answer. I'm not that familiar with the CSV module so I'm only guessing here, but hopefully it puts you on the right track to finding a solution. Good luck. — Shawn Chin, May 24 '12 at 07:52
That was it (concerning the quotes thing) and the links seems to have a good solution for me. Thx again ;-) — Carto_, May 24 '12 at 09:09
@Carto_ BTW, for future reference you might want to consider posting a separate question (but search for existing answers first!) if you have follow-on questions. This ensures you get a prompter answer as well as make it more searchable which helps others with similar problems. — Shawn Chin, May 24 '12 at 09:18

score 2 · Answer 2 · answered May 23 '12 at 15:28

Read the csv files line by line into sets. Compare the sets.

>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])

Your solution seems totally perfect ! I have only one question: When I try to put the content into set, it Give me this error : csv_content = set(past) TypeError: unhashable type: 'list' I've edited my question — Carto_, May 23 '12 at 15:38

Pythonic way to Compare two CSV files to track changes

2 Answers2