Method 1
Looks like a job for grep
's -v flag.
grep -v -F -f listtocheck uniques
Method 2
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep
can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check
will appear, in the output combined by the subshell in brackets, only once. Words in reference_file
will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u
does.
Optimization I
If reference_file
contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file
, in order to have a smaller output and weigh less on the final sort
.
Optimization II
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Optimization III
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq
twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq
works in linear time while sort
is linearithmic.
uniq < file | sort | uniq > .tmp.1