5

I have a file I get every day that has 10,000 records in it, 99% of which were in the last day's file. How can I use the macOS command line to remove the lines in the newer file that exist in the previous day's file?

remove_duplicates newfile oldfile

These files look like this:

"First Last"\t"email"\t"phone"\t"9 more columns..."

Note, I tried this awk solution, but it didn't output anything, even though I confirmed duplicate lines.

Chuck
  • 4,662
  • 2
  • 33
  • 55
  • sounds like you should persist your files to a database, then you can only insert new rows if it does not already exist. Most common SQL implementations will support this type of operation. – Hunter McMillen May 16 '18 at 01:45
  • Ironically, it's going into a database after this, and the database in question, FileMaker, which doesn't support SQL for inserting rows, doesn't support this. So the import takes forever for 10K records, and then I'm deleting all but a few dozen. Trying to speed that up. – Chuck May 16 '18 at 01:51

3 Answers3

5

You could likely use grep with the -v (invert-match) and -f (file) options:

grep -v -f oldfile newfile > newstrip

It matches any lines in newfile that are not in oldfile and saves them to newstrip. If you are happy with the results you could easily do afterward:

mv newstrip newfile

This will overwrite newfile with newstrip (removing newstrip).

l'L'l
  • 44,951
  • 10
  • 95
  • 146
4

The comm command takes two file arguments and prints three columns: lines unique to the first file, lines unique to the second file, and lines occurring in both files. So if you have two files where one is a copy of the other one plus a few lines, like this:

oldfile:

line1
line2
line3

newfile:

line1
line2
line3
line4
line5

you can use comm as follows:

$ comm -13 oldfile newfile
line4
line5

where -13 stands for "suppress columns 1 and 3", i.e., print only lines unique to the second file.

comm expects its inputs to be sorted and will complain if they aren't (at least the GNU version of comm does), but if your files really are copies of each other plus extra lines in one of them, you can suppress that warning:

comm --nocheck-order -13 oldfile newfile

--nocheck-order exists only in GNU comm, which is part of the GNU coreutils (can be installed via homebrew, for example).

If the warning about the files being unsorted is a show stopper and the order of the output lines doesn't matter, you could also sort the input files:

comm -13 <(sort oldfile) <(sort newfile)
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • The macOS version of `comm` does not support `--nocheck-order`. – zneak May 16 '18 at 01:55
  • @zneak I recommend installing GNU coreutils then. According to the macOS comm manpage, it only expects its input files to be sorted, not sure if it'll just go on if they aren't. I'll point out the GNUism. – Benjamin W. May 16 '18 at 01:58
1

In terms of a bash script, a solution I can come up with is:

sort newfile | uniq | cat oldfile oldfile - | sort | uniq -u

Broken down:

  • sort newfile: sort the rows in newfile (necessary for uniq)
  • uniq: keep at most one copy of each identical row
  • cat oldfile oldfile -: read out oldfile twice and append the output of the previous call to uniq
  • sort: sort rows, as required for uniq
  • uniq -u: only keep rows that appear exactly once

Since oldfile is written out twice, every row in oldfile will be discarded by uniq -u. You'll be left with rows that appear only in newfile.

Obvious caveats: your file is now sorted and you only have one of each duplicated row.

zneak
  • 134,922
  • 42
  • 253
  • 328