Find lines from a file which are not present in another file

Question

I have two files (let's say a.txt and b.txt), both of which has a list of names. I have already run sort on both the files.

Now I want to find lines from a.txt which are not present in b.txt.

(I spent lot of time to find the answer for this question, so documenting it for future reference)

score 237 · Accepted Answer · edited Mar 27 '23 at 08:24

237

The command you have to use is not diff but comm

/usr/bin/comm -23 a.txt b.txt

By default, comm outputs 3 columns: left-only, right-only, both. The -1, -2 and -3 switches suppress these columns.

So, -23 hides the right-only and both columns, showing the lines that appear only in the first (left) file.

If you want to find lines that appear in both, you can use -12, which hides the left-only and right-only columns, leaving you with just the both column.

edited Mar 27 '23 at 08:24

MUY Belgium

2,330
4
30
46

answered Jan 23 '13 at 05:32

Sudar

18,954
30
85
131

34

I will add that this works only if both files are sorted. (I know the OP mentioned he sorted the files, but many people, me included, read the question title and then jump to the answers) – user247866 Apr 04 '14 at 18:06
12

@user247866: Fortunately comm is kind enough to tell you if they are not sorted :) – marlar Feb 29 '16 at 10:48

score 45 · Answer 2 · edited Apr 13 '17 at 12:36

The simple answer did not work for me because I didn't realize comm matches line for line, so duplicate lines in one file will be printed as not-existing in the other. For example, if file1 contained:

Alex
Bill
Fred

And file2 contained:

Alex
Bill
Bill
Bill
Fred

Then comm -13 file1 file2 would output:

Bill
Bill

In my case, I wanted to know only that every string in file2 existed in file1, regardless of how many times that line occurred in each file.

Solution 1: use the -u (unique) flag to sort:

comm -13 <(sort -u file1) <(sort -u file2)

Solution 2: (the first "working" answer I found) from unix.stackexchange:

fgrep -v -f file1 file2

Note that if file2 contains duplicate lines that don't exist at all in file1, fgrep will output each of the duplicate lines. Also note that my totally non-scientific tests on a single laptop for a single (fairly large) dataset showed Solution 1 (using comm) to be almost 5 times faster than Solution 2 (using fgrep).

I had my files sorted and passed through uniq. Anyways thanks for the other solutions. — Sudar, Oct 01 '14 at 07:04
The `fgrep` version will be very slow, if you have tens of thousands of lines. — Kai Petzke, Dec 15 '21 at 06:58

score 19 · Answer 3 · answered Jun 19 '16 at 09:30

19

I am not sure why it has been said diff should not be used. I would use it to compare the two files and then output only lines that are in the left file but not in right one. Such lines are flagged by diff with < so it suffices to grep that symbol at the beginning of the line

diff a.txt b.txt  | grep \^\<

answered Jun 19 '16 at 09:30

simonemainardi

505
3
7

4

You can use `diff --new-line-format= --unchanged-line-format= a.txt b.txt` to suppress the printing of new and unchanged lines. – David Conrad Apr 04 '17 at 05:36
diff worked fine for me. I am on win10, no comm installed. – Radim Cernej Oct 27 '21 at 06:13

score 16 · Answer 4 · answered Jul 21 '17 at 11:30

16

In the case the files wouldn't be sorted yet, you can use:

comm -23 <(sort a.txt) <(sort b.txt)

answered Jul 21 '17 at 11:30

Basj

41,386
99
383
673

3

This allocated like 15GB of memory for me for a couple files each < 300 MB... – user541686 Jan 13 '19 at 17:04

Find lines from a file which are not present in another file

4 Answers4

Linked

Related