3

My Problem: I have 2 large csv files, with millions of lines.

The one file contains a backup of a database from my server, and looks like:

securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...

Now I have another CSV file, containing new codes like, with the exact same schema.

I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.

I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv

First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way

I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.

I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later: diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes

But I believe there has to be a better way.

So I ask you, if you have any ideas, how to improve this method.

EDIT:

comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.

But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like

securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...

I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.

I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^

NationBoneless
  • 308
  • 2
  • 12
  • Please, post some sample of the other csv file for testing and the expected output. Add them to your question, do not post them as comments or images. Thanks. – James Brown Dec 17 '20 at 19:42
  • Do you only care about checking the new file where the code is `,false`? The correct way will be to use `awk`. Otherwise you end up making multiple passes over each of your million line files. In other words, do you want to skip the common names in other file that are already scanned `,true`? – David C. Rankin Dec 18 '20 at 08:18

3 Answers3

4

Yes! a highly underrated tool comm is great for this. Stolen examples from here.

Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b

Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b

Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'

As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)

If you do prefer to stick with diff, you can get it to do what you want without the grep:

diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt

You could then alias that diff command to "comp" or something similar to allow you to just:

comp 1.txt 2.txt

That might be handy if this is a command you are likely to use often in future.

axwr
  • 2,118
  • 1
  • 16
  • 29
  • 1
    +1 for mentioning `comm`, but should point out that input files need to be sorted beforehand; perhaps add example: `comm -12 <(sort a) <(sort b)` ? – markp-fuso Dec 17 '20 at 20:25
  • 1
    @markp-fuso, That's a good point. I didn't mention it as the original poster mentioned that had sorted the files already but it's better to be explicit. I've added your example. – axwr Dec 17 '20 at 20:48
  • 1
    Note that the answer here isn't restricted to bash and as far as I know should work under at least zsh and ksh as well. –  Dec 17 '20 at 21:06
  • @Roadowl great point, it works for me under fish. – axwr Dec 17 '20 at 21:07
0

I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v

grep -vFxf serverBackup.csv newCodes.csv 

or first split serverBackup.csv

split -a 4  --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
   grep -vFxf "${f}" newCodes.csv > smaller
   mv smaller newCodes.csv
done
rm splitted*
Walter A
  • 19,067
  • 2
  • 23
  • 43
0

Given:

$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true

$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

You could use awk:

$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
dawg
  • 98,345
  • 23
  • 131
  • 206