2

CSV_1.csv has the structure:

ABC
DEF
GHI
JKL
MNO
PQR

CSV_2.csv has the structure:

XYZ
DEF
ABC

CSV_2.csv is a lot smaller than CSV_1.csv and a lot of the rows that exist in CSV_2.csv appears in CSV_1.csv. I want to figure out if there are rows that exist in CSV_2.csv but not in CSV_1.csv.

These files are not sorted.

The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.

How would I go about doing this? I tried python but taking each row from CSV_2.csv and comparing with 10 million rows in CSV_1.csv takes a lot of time.

Here is what I tried in python:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open('update.csv', 'a') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

awk comes to mind. What would the exact code be for awk?

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Shoumik
  • 143
  • 1
  • 14
  • Have a look at http://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash, it has lot of well researched answers. Pick the one that suits your needs – Inian Apr 24 '17 at 06:07
  • You can go through the following questions : http://stackoverflow.com/questions/5268929/python-comparing-two-csv-files-and-searching-for-similar-items http://stackoverflow.com/questions/11108667/comparing-two-csv-files-and-getting-difference – user2125722 Apr 24 '17 at 06:10
  • @Inian The one you linked does not answer my question. I have also looked at other similar questions here on SO. – Shoumik Apr 24 '17 at 06:12
  • @user2125722 I already tried python. Very, very slow. – Shoumik Apr 24 '17 at 06:12
  • If you have tried python you should share, so that someone might help you with it. – Stephen Rauch Apr 24 '17 at 06:14
  • @StephenRauch added. – Shoumik Apr 24 '17 at 06:19
  • @Inian No, I want the fastest solution, considering the size of both the files. – Shoumik Apr 24 '17 at 07:05
  • @Shoumik: Then suggest removing the `Awk` tag if you are not looking for an answer in it – Inian Apr 24 '17 at 07:06

4 Answers4

3

Yes, your approach is very inefficient. The following should be much faster, using O(1) lookup-time of sets, and iterating over the lines in t2 lazily:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = frozenset(t1)

    with open('update.csv', 'a') as outFile:
        for line in t2:
            if line not in fileone:
                outFile.write(line)
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
0

To speed up the python implementation, you should use a data structure which is fast for lookups. You should try a set:

Change:

fileone = t1.readlines()

To:

fileone = set(t1.readlines())

This will considerably speed up the line:

if line not in fileone:
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
0

You can use pandas dataframe. Create 2 data frames from both the csv.

import pandas as pd
df1= pd.DataFrame.from_csv('CSV_1.csv')
df2= pd.DataFrame.from_csv('CSV_2.csv')
>>> df1
   val
0  ABC
1  DEF
2  GHI
3  JKL
4  MNO
5  PQR
>>> 
>>> df2
   val
0  XYZ
1  DEF
2  ABC
>>> df = pd.merge(df1, df2, how='outer', indicator=True)
>>> df
   val      _merge
0  ABC        both
1  DEF        both
2  GHI   left_only
3  JKL   left_only
4  MNO   left_only
5  PQR   left_only
6  XYZ  right_only
>>> uniqueRowsInCsv2 = df[ df['_merge'] == 'right_only' ]
>>> uniqueRowsInCsv2
   val      _merge
6  XYZ  right_only
>>> 
Nilanjan
  • 176
  • 8
0

You could load the data in sets and use the set difference operation to speed up:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    old_set = set(t1.readlines())
    new_set = set(t2.readlines())

# values in new_set but not in old_set
differences =  new_set.difference(old_set)
with open('update.csv', 'a') as outFile:
    for difference in differences:
        outFile.write(difference)
Abhishek Balaji R
  • 665
  • 10
  • 25