Fastest way of finding differences between two files in unix?

Question

I want to find the difference between two files and then put only the differences in a third file. I saw different approaches using awk, diff and comm. Are there any more ?

eg.Compare two files line by line and generate the difference in another file

eg.Copy differences between two files in unix

I need to know which is the fastest way of finding all the differences and listing them in a file for each of the cases below -

Case 1 - file2 = file1 + extra text appended.
Case 2 - file2 and file1 are different.

since this depends on your inputs, it is best to time it yourself — perreal, Aug 05 '13 at 23:51
can you please make your cases more specific, and maybe give some sample code of things you've tried? — asf107, Aug 05 '13 at 23:51
For Case 2 there is `cmp` that compare two files byte by byte. — micke, Aug 05 '13 at 23:56
You already have a number of alternatives. Use the `time` command to find your answer. — Paulo Almeida, Aug 06 '13 at 00:15
"Differences" is really undefined, and could mean a lot of things. As for your "Are there more?" question: Of course, anyone could write a new program to find differences. — BraveNewCurrency, Aug 06 '13 at 02:58

score 51 · Accepted Answer · answered Aug 05 '13 at 23:54

51

You could try..

comm -13 <(sort file1) <(sort file2) > file3

or

grep -Fxvf file1 file2 > file3

or

diff file1 file2 | grep "<" | sed 's/^<//g'  > file3

or

join -v 2 <(sort file1) <(sort file2) > file3

answered Aug 05 '13 at 23:54

danmc

1,172
13
11

2

Using two large text files where one has an extra paragraph of text near the beginning, I timed all four methods. The grep, diff, and join methods all failed to find the extra paragraph. The diff methods needs to grep ">" in addition to "<" to work. I'm not familiar with the grep or join methods. The results: comm: 3.661s, grep: 0.035s, diff: 0.051s, join: 3.811s – Jason Hartley Dec 31 '14 at 16:52
Your answer is wrong. Find what is missing in file1 from file2. Right answer: comm -3 <(sort file1) <(sort file2) | tr -d '\t' – binbjz Mar 27 '19 at 15:55

pron · Answer 2 · 2014-04-29T15:45:09.680

16

Another option:

sort file1 file2 | uniq -u > file3

If you want to see just the duplicate entries use "uniq -d" option:

sort file1 file2 | uniq -d > file3

edited Apr 29 '14 at 15:45

answered Apr 29 '14 at 15:37

pron

161
1
4

I like this answer the best because it is straightforward, intuitive, and doesn't involve some complex command line options/syntax. – wisbucky Aug 28 '19 at 22:15
1

Note: one distinction is that for a line that is different, this `uniq` solution will print both `file1` and `file2` versions of the line. The `comm` and `greq` will only print the `file2` version. – wisbucky Aug 28 '19 at 22:29

score 1 · Answer 3 · answered Aug 07 '13 at 13:01

1

You could also try to include md5-hash-sums or similar do determine whether there are any differences at all. Then, only compare files which have different hashes...

answered Aug 07 '13 at 13:01

P_M

328
1
7

But is hashing two files faster than comparing two files? – Jason Hartley Dec 31 '14 at 16:17

score 0 · Answer 4 · answered Apr 17 '15 at 08:58

0

This will work fast:

Case 1 - File2 = File1 + extra text appended.

grep -Fxvf File2.txt File1.txt >> File3.txt

File 1: 80 Lines File 2: 100 Lines File 3: 20 Lines

answered Apr 17 '15 at 08:58

James Bond 86

1

Fastest way of finding differences between two files in unix?

4 Answers4