Difference between two files without sorting

Question

I have the files file1 and file2, where file2 is a subset of file1. That means, if I iterate over file1, there are some lines that are in file2, and some that aren't, but there is no line in file2 that is not in file1. There may be several lines with the same content in a file. Now I want to get the difference between them, that is, all lines of file1 that aren't in file2.

According to this well received answer

diff(1) isn't the answer, comm(1) is.

(For whatever reason)

But as I understand, for comm the files need to be sorted first. The problem: Both files are ordered (not sorted!), and this order needs to be kept. So what I really want is to iterate over file1, and check for every line, if it is also in file2. If not, write it to file3. If the same content occurs more than once, it should be kept more than once!

Is there any way to do this with the command line?

Cyrus · Answer 1 · 2016-03-01T21:06:01.040

5

Try this with GNU grep:

grep -vFf file2 file1 > file3

Update:

grep -vxFf file2 file1 > file3

edited Mar 01 '16 at 21:06

answered Mar 01 '16 at 07:13

Cyrus

84,225
14
89
153

Looks good after looking at the first lines. I can't really say for sure (file is way too long), but I assume that's the solution. Thanks! – Yanick Nedderhoff Mar 01 '16 at 07:22
Hmm ok I just compared the line numbers. It should be 5213, but it is 5211. Very small difference, but not entirely working, unfortunately. – Yanick Nedderhoff Mar 01 '16 at 07:25
Please upload file1 and file2 somewhere. – Cyrus Mar 01 '16 at 14:50
Perhaps file2 has `short line` and file1 has `short line with more text` and `short line`. – Walter A Mar 01 '16 at 20:02
@WalterA: Thank you. I've updated my answer to avoid this substring problem. – Cyrus Mar 01 '16 at 21:06
Another difference may be repeated lines in file1 (lines a b a versus lines a b: Should the second line a be considered as difference?). – Walter A Mar 02 '16 at 07:30

score 0 · Answer 2 · edited May 23 '17 at 12:02

0

I think you do not want to sort for avoiding temp files. This is possible with process substitution:

diff <(sort file1) <(sort file2)
# or
comm <(sort file1) <(sort file2)

Edit: Using https://stackoverflow.com/a/4544925/3220113 I found another alternative (for text files with short lines):

diff -a --suppress-common-lines -y file2 file1 | sed 's/\s*>.//'

edited May 23 '17 at 12:02

Community

1
1

answered Mar 01 '16 at 07:17

Walter A

19,067
2
23
43

I don't want to sort because I want to keep the order. – Yanick Nedderhoff Mar 01 '16 at 07:19
I added an alternative without sorting – Walter A Mar 02 '16 at 07:27

Difference between two files without sorting

2 Answers2