Find unmatched items between two list using bash or DOS

Question

I have two files with two single-column lists:

//file1 - full list of unique values
AAA
BBB
CCC

//file2
AAA
AAA
BBB
BBB

//So the result here would be:
CCC

I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.

Thank you.

Possible duplicate: (http://stackoverflow.com/questions/11461432/batch-file-to-compare-contents-of-a-text-file/11472397#11472397) — wmz, Aug 14 '12 at 15:37

LSerni · Answer 1 · 2014-02-21T23:35:19.047

Method 1

Looks like a job for grep's -v flag.

grep -v -F -f  listtocheck uniques

Method 2

A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep can't do unless they are first merged), would be:

(
    sort < file_to_check | uniq
    cat reference_file reference_file
) | sort | uniq -u

By doing this, any words in file_to_check will appear, in the output combined by the subshell in brackets, only once. Words in reference_file will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.

There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u does.

Optimization I

If reference_file contains a lot of duplicates, it might be worthwhile to run a heavier

sort < reference_file | uniq
sort < reference_file | uniq

instead of cat reference_file reference_file, in order to have a smaller output and weigh less on the final sort.

Optimization II

This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore

sort < file_to_check  | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2

Optimization III

Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq works in linear time while sort is linearithmic.

uniq < file | sort | uniq > .tmp.1

score 2 · Answer 2 · edited May 23 '17 at 12:27

For a Windows CMD solution (commonly referred to as DOS, but not really):

It should be as simple as

findstr /vlxg:"file2" "file1"

but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.

If a case insensitive search is acceptable, then adding the /I option circumvents the bug.

findstr /vlixg:"file2" "file1"

If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.

It is also easy to write a VBScript or JScript solution for Windows.

score 1 · Answer 3 · answered Aug 14 '12 at 16:03

1

cat file1 file2 | sort | uniq -u

answered Aug 14 '12 at 16:03

Drake Clarris

1,047
6
10

that would also yield single entries from file2 not matched in file1 – tzelleke Aug 14 '12 at 16:10