My Problem: I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv
and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv
but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv
.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^