Fast way of finding lines in one file that are not in another?

Question

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1
line2
line3

And this is file2:

line1
line4
line5

Then my result/output should be:

line2
line3

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?

you could try this if it is faster: `awk 'NR==FNR{a[$0];next}!($0 in a)' file2 file1 > out.txt` — Kent, Aug 13 '13 at 09:09
without fast requirement: http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file — Ciro Santilli OurBigBook.com, Jun 27 '15 at 08:49
See also: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash). — codeforester, Mar 05 '18 at 18:07
Simple way with reduced tool set: `cat file1 file2 file2 | sort | uniq --unique`, see my answer below. — Ondra Žižka, Dec 19 '18 at 15:06
Are the files sorted? Meaning, do you expect the equal lines to be in the same relative place? — AlikElzin-kilaka, Feb 26 '23 at 08:13

JnBrymn · Answer 1 · 2018-06-18T03:35:59.837

442

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2

The man file is actually quite readable for this.

edited Jun 18 '18 at 03:35

answered Oct 28 '14 at 21:46

JnBrymn

24,245
28
105
147

6

Works flawlessly on OSX. – Pikachu Jun 28 '16 at 21:27
69

The requirement for sorted input should perhaps be highlighted. – tripleee Apr 18 '17 at 04:55
35

`comm` also has an option to verify input is sorted, `--check-order` (which it seems to do anyway, but this option will cause it to error instead of continue). But to sort the files, simply do: `com -23 <(sort file1) <(sort file2)` and so on – michael Nov 07 '17 at 06:00
2

I was comparing a file that was generated in Windows against a file that was generated in Linux and it seemed like `comm` wasn't working at all. It took me a while to figure out that it's about the line endings: even lines that look identical are considered different if they have different line endings. The command `dos2unix` can be used to convert the CRLF line endings to LF only. – ZeroOne Mar 27 '20 at 10:30
2

The caveat "this does not work for files with DOS line endings" would have to be added to more or less every single shell script answer. This is a common FAQ; see https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings – tripleee Sep 25 '21 at 14:53

score 293 · Accepted Answer · edited Apr 13 '17 at 12:36

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.

Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u (note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change). You can also use this to do other useful things like number each line with %dn.

The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

This does exactly what I need, in a tiny fraction of the time taken by the enormous grep. Thanks! — Niels2000, Aug 13 '13 at 09:47
Found this [gnu manpage](http://www.gnu.org/software/diffutils/manual/html_node/Line-Formats.html) — Juto, Aug 13 '13 at 14:06
@rogerdpack no probs: [Apple Xcode](https://developer.apple.com/xcode/features/) includes GNU `diff` (and more). — mr.spuratic, Mar 19 '15 at 13:52
@mr.spuratic Excellent solution! One thing I noticed is that although it works, the shell returns exit code 1 instead of 0. Any idea why? — Archeosudoerus, Nov 02 '18 at 17:12
I assume you mean for `diff`: in general the input files will be different, 1 is returned by `diff` in that case. Consider it a bonus ;-) If you're testing in a shell script 0 and 1 are expected exit codes, 2 indicates a problem. — mr.spuratic, Nov 02 '18 at 17:27
@mr.spuratic ah yeah, now I find it in the `man diff`. Thanks! — Archeosudoerus, Nov 02 '18 at 19:17
If you do not immediately see any output, swap file1 and file2 around. — Roel Van de Paar, Jun 06 '19 at 01:31
Sorting `file2` scales worse than `grep -vf` so this answer is not a solution for large `file2`. — rwst, May 01 '20 at 07:51

score 49 · Answer 3 · edited Dec 02 '21 at 18:45

49

Like konsolebox suggested, the posters grep solution

grep -v -f file2 file1

actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.

These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.

So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:

grep -F -x -v -f file2 file1

This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.

edited Dec 02 '21 at 18:45

rogerdpack

62,887
36
269
388

answered Jul 05 '16 at 16:35

pbz

699
6
6

1

Yeah, it works but even with `-F` this doesn't scale well. – Molomby Sep 28 '16 at 04:11
this is not that fast, I waited 5 minutes for 2 files of ~500k lines before giving up – cahen Jan 06 '17 at 16:22
actually, this way is still slower than comm way, because this one can handle unsorted files hence dragged down by unsorting, comm takes the advantage of sorting – http8086 May 23 '18 at 08:14
@workplaylifecycle You need to add the time for sorting which may be the bottleneck for extremely large `file2`. – rwst May 01 '20 at 07:38
1

However, grep with the `-x` option apparently uses more memory. With a `file2` containing 180M words of 6-10 bytes my process got `Killed` on a 32GB RAM machine... – rwst May 01 '20 at 07:44
For anyone wondering: 100K lines is too much in practice for this solution. – Bjinse May 24 '22 at 10:50

score 26 · Answer 4 · answered Nov 22 '18 at 08:08

26

If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:

cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

Test:

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2

This is also relatively fast, compared to grep.

answered Nov 22 '18 at 08:08

Ondra Žižka

43,948
41
217
277

1

Note -- some implementations will not recognize the `--unique` option. You should be able to use the [standardized POSIX option](https://www.unix.com/man-page/POSIX/1posix/uniq/) for this: `| uniq -u` – AndrewF Apr 10 '19 at 02:49
1

In the example, where did the "2" come from? – Niels2000 Oct 14 '19 at 06:44
5

@Niels2000, `seq 1 1 7` creates numbers from 1, with increment 1, until 7, i.e. 1 2 3 4 5 6 7. And right there is your 2! – Eirik Lygre Apr 10 '20 at 21:20

GypsyCosmonaut · Answer 5 · 2019-11-20T21:17:11.117

17

Use combine from moreutils package, a sets utility that supports not, and, or, xor operations

combine file1 not file2

i.e give me lines that are in file1 but not in file2

OR give me lines in file1 minus lines in file2

Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.

So in effect you are saying

Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2

In my experience, it's much faster than other options

edited Nov 20 '19 at 21:17

answered Oct 03 '19 at 02:32

GypsyCosmonaut

661
12
17

This is the cleanest answer. While all the other grep solutions and such are clever, this is a simple command built specifically for this purpose. – David Ljung Madison Stellar Apr 06 '23 at 00:49

score 15 · Answer 6 · answered Aug 13 '13 at 09:12

15

whats the speed of as sort and diff?

sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted

answered Aug 13 '13 at 09:12

Puggan Se

5,738
2
22
48

1

Thanks for reminding me about the need to sort the files before doing diff. sort + diff is MUCH faster. – Niels2000 Aug 13 '13 at 09:48
6

one liner ;-) diff <(sort file1 -u) <(sort file2 -u) – steveinatorx Sep 13 '16 at 22:18

score 13 · Answer 7 · answered Oct 25 '20 at 07:32

13

This seems quick for me :

comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt

answered Oct 25 '20 at 07:32

OBX

6,044
7
33
77

Awesome, but for target question just `comm file1 file2` because looks like sorted lists provided – storenth Dec 18 '20 at 07:17
Since comm has its own idea about what "sorted" means, you would have to suppress errors if both files are ordered the same, but not "sorted" according to comm's ideals. F.e. if both files are output of the same transformation but with different filters. In my case it was a list of all BTRFS subvolumes vs. list of read-only snapshots. Always in same order, but not exactly "sorted" lexically. – AnrDaemon Oct 12 '22 at 09:16

Zombo · Answer 8 · 2014-08-31T23:53:39.117

6

$ join -v 1 -t '' file1 file2
line2
line3

The -t makes sure that it compares the whole line, if you had a space in some of the lines.

edited Aug 31 '14 at 23:53

answered Aug 31 '14 at 08:49

Zombo

1
62
391
407

Like `comm`, `join` requires both input lines to be sorted on the field you are performing the join operation on. – tripleee Apr 18 '17 at 04:55

score 5 · Answer 9 · answered Aug 10 '17 at 07:24

You can use Python:

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'

score 2 · Answer 10 · answered Aug 13 '13 at 09:17

2

Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.

You could try one of these Awk methods:

http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219

answered Aug 13 '13 at 09:17

konsolebox

72,135
12
99
105

2

+1 This is the only answer which doesn't require inputs to be sorted. While apparently the OP was happy with that requirement, it is an unacceptable constraint in many real-world scenarios. – tripleee Nov 06 '14 at 09:16

score 2 · Answer 11 · answered Mar 13 '18 at 16:22

2

The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.

diff -y --suppress-common-lines file1.txt file2.txt

answered Mar 13 '18 at 16:22

baustin

21
3

Fast way of finding lines in one file that are not in another?

11 Answers11

Linked

Related