CSV_1.csv
has the structure:
ABC
DEF
GHI
JKL
MNO
PQR
CSV_2.csv
has the structure:
XYZ
DEF
ABC
CSV_2.csv
is a lot smaller than CSV_1.csv
and a lot of the rows that exist in CSV_2.csv
appears in CSV_1.csv
. I want to figure out if there are rows that exist in CSV_2.csv
but not in CSV_1.csv
.
These files are not sorted.
The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.
How would I go about doing this? I tried python but taking each row from CSV_2.csv
and comparing with 10 million rows in CSV_1.csv
takes a lot of time.
Here is what I tried in python:
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'a') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
awk
comes to mind. What would the exact code be for awk
?