Find only the rows that exist in one csv but not the other

Question

CSV_1.csv has the structure:

ABC
DEF
GHI
JKL
MNO
PQR

CSV_2.csv has the structure:

XYZ
DEF
ABC

CSV_2.csv is a lot smaller than CSV_1.csv and a lot of the rows that exist in CSV_2.csv appears in CSV_1.csv. I want to figure out if there are rows that exist in CSV_2.csv but not in CSV_1.csv.

These files are not sorted.

The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.

How would I go about doing this? I tried python but taking each row from CSV_2.csv and comparing with 10 million rows in CSV_1.csv takes a lot of time.

Here is what I tried in python:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open('update.csv', 'a') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

awk comes to mind. What would the exact code be for awk?

Have a look at http://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash, it has lot of well researched answers. Pick the one that suits your needs — Inian, Apr 24 '17 at 06:07
You can go through the following questions : http://stackoverflow.com/questions/5268929/python-comparing-two-csv-files-and-searching-for-similar-items http://stackoverflow.com/questions/11108667/comparing-two-csv-files-and-getting-difference — user2125722, Apr 24 '17 at 06:10
@Inian The one you linked does not answer my question. I have also looked at other similar questions here on SO. — Shoumik, Apr 24 '17 at 06:12
If you have tried python you should share, so that someone might help you with it. — Stephen Rauch, Apr 24 '17 at 06:14
@Inian No, I want the fastest solution, considering the size of both the files. — Shoumik, Apr 24 '17 at 07:05
@Shoumik: Then suggest removing the `Awk` tag if you are not looking for an answer in it — Inian, Apr 24 '17 at 07:06

score 3 · Answer 1 · answered Apr 24 '17 at 06:25

3

Yes, your approach is very inefficient. The following should be much faster, using O(1) lookup-time of sets, and iterating over the lines in t2 lazily:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = frozenset(t1)

    with open('update.csv', 'a') as outFile:
        for line in t2:
            if line not in fileone:
                outFile.write(line)

answered Apr 24 '17 at 06:25

juanpa.arrivillaga

88,713
10
131
172

Your solution gave me: fileone = frozenset(1) TypeError: 'int' object is not iterable – Shoumik Apr 24 '17 at 07:13
1

@Shoumik that wasn't my solution, I wrote `fileone = frozenset(t1)`, so `frozenset(**t1**)` not `frozenset(1)` – juanpa.arrivillaga Apr 24 '17 at 07:14

score 0 · Answer 2 · answered Apr 24 '17 at 06:24

To speed up the python implementation, you should use a data structure which is fast for lookups. You should try a set:

Change:

fileone = t1.readlines()

To:

fileone = set(t1.readlines())

This will considerably speed up the line:

if line not in fileone:

score 0 · Answer 3 · answered Apr 24 '17 at 06:29

You can use pandas dataframe. Create 2 data frames from both the csv.

import pandas as pd
df1= pd.DataFrame.from_csv('CSV_1.csv')
df2= pd.DataFrame.from_csv('CSV_2.csv')
>>> df1
   val
0  ABC
1  DEF
2  GHI
3  JKL
4  MNO
5  PQR
>>> 
>>> df2
   val
0  XYZ
1  DEF
2  ABC
>>> df = pd.merge(df1, df2, how='outer', indicator=True)
>>> df
   val      _merge
0  ABC        both
1  DEF        both
2  GHI   left_only
3  JKL   left_only
4  MNO   left_only
5  PQR   left_only
6  XYZ  right_only
>>> uniqueRowsInCsv2 = df[ df['_merge'] == 'right_only' ]
>>> uniqueRowsInCsv2
   val      _merge
6  XYZ  right_only
>>>

Abhishek Balaji R · Answer 4 · 2017-04-24T06:50:22.403

0

You could load the data in sets and use the set difference operation to speed up:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    old_set = set(t1.readlines())
    new_set = set(t2.readlines())

# values in new_set but not in old_set
differences =  new_set.difference(old_set)
with open('update.csv', 'a') as outFile:
    for difference in differences:
        outFile.write(difference)

edited Apr 24 '17 at 06:50

answered Apr 24 '17 at 06:39

Abhishek Balaji R

665
10
25

Find only the rows that exist in one csv but not the other

4 Answers4