Comparing two lists with a shell script

Question

Suppose I have two lists of numbers in files f1, f2, each number one per line. I want to see how many numbers in the first list are not in the second and vice versa. Currently I am using grep -f f2 -v f1 and then repeating this using a shell script. This is pretty slow (quadratic time hurts). Is there a nicer way of doing this?

What is the format of the file? One number per line? Are the characters supposed to represent integers or floats? Would a python script do? — unutbu, Oct 21 '09 at 11:54
Here is some information on associative arrays in bash: http://stackoverflow.com/questions/688849/associative-arrays-in-shell-scripts — unutbu, Oct 21 '09 at 11:56
One number per line. Integers. I don't actually want a Python script because I am trying to learn more shell scripting. (the original purpose of the shell script was to check my python program was working) — Casebash, Oct 21 '09 at 12:23

score 8 · Accepted Answer · answered Oct 21 '09 at 15:15

8

I like 'comm' for this sort of thing. (files need to be sorted.)

$ cat f1
1
2
3
$ cat f2
1
4
5
$ comm f1 f2
        1
2
3
    4
    5
$ comm -12 f1 f2
1
$ comm -23 f1 f2
2
3
$ comm -13 f1 f2
4
5
$

answered Oct 21 '09 at 15:15

Stephen Paul Lesniewski

1,371
1
9
3

For numerical results it complained that it wasn't in sorted order. --nocheck-order will suppress – Casebash Oct 21 '09 at 23:30
Again, a simple grep and wc can be used to find the actual result – Casebash Oct 21 '09 at 23:30

score 2 · Answer 2 · answered Oct 21 '09 at 11:35

2

Couldn't you just put each number in a single line and then diff(1) them? You might need to sort the lists beforehand, though for that to work properly.

answered Oct 21 '09 at 11:35

Joey

344,408
85
689
683

Will that actually provide counts? – Casebash Oct 21 '09 at 12:30
Not as such, but you can get that with `grep`/`wc` afterwards. This was just a suggestion on how to improve the quadratic runtime. You will get a somehow (depending on the options to `diff`) readable list of differences. You can just count them, then. – Joey Oct 21 '09 at 12:43
Okay, will have to play around with this – Casebash Oct 21 '09 at 13:19
diff will have a < for values in the second, but not the first and > for values in the first but not the second. A simple grep and wc should provide the desired answer – Casebash Oct 21 '09 at 23:27

score 1 · Answer 3 · answered Oct 21 '09 at 12:08

In the special case where one file is a subset of the other, the following:

cat f1 f2 | sort | uniq -u

would list the lines only in the larger file. And of course piping to wc -l will show the count.

However, that isn't exactly what you described.

This one-liner serves my particular needs often, but I'd love to see a more general solution.

Comparing two lists with a shell script

3 Answers3