Compare 2 similar files and only output the differences, preserving the order in which they occur?

Question

hoping someone can help me get my head around this

I have 2 files, one is 325 lines long, one is 361 lines long.

The bulk of these files is identical content but the 2nd one has random extra lines inserted. I am only interested in the extra lines, and I need to preserve the order in which they occur in the file.

The files contain a repeating paragraph of approximately 31 lines - I know the first and last line of this paragraph, and have no problems with dropping the entire paragraph, but can't work out how.

i.e. File1

The quick brown
fox jumped 
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog

i.e. File2

The quick brown
fox jumped
over the
lazy dog
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
The quick brown
fox jumped
over the
lazy dog
djakdjhgmv
asdjkljkgfyiyi
The quick brown
fox jumped
over the
lazy dog
jghytpuptou

I need to output only the extra lines in this order:

sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
djakdjhgmv
asdjkljkgfyiyi
jghytpuptou

Any help or advice would be gratefully received, I am not a *nix person unfortunately :( I tried a few diff expressions and comm expressions, but can't get what I need.

score 3 · Answer 1 · edited Dec 20 '11 at 18:37

3

Try this magic command:

diff file1.txt file2.txt | sed -n 's/^> \(.*\)/\1/p'

diff file1.txt file2.txt should output something like

2c2
< fox jumped 
---
> fox jumped
4a5,7
> sadhasdgh
> qyyutrytkdaslksad
> utyiuiytiuyo
8a12,13
> djakdjhgmv
> asdjkljkgfyiyi
12a18
> jghytpuptou

sed -n 's/^> $.*$/\1/p' should find lines staring with > and output that lines without >. Possible reason why this doesn't work is different output of diff at your system?

edited Dec 20 '11 at 18:37

jaypal singh

74,723
23
102
147

answered Dec 20 '11 at 17:46

alexander

2,703
18
16

Couldn't get this to work, but thanks anyway - got no output when I tried even on the files above. – user1108364 Dec 20 '11 at 18:04
Worked out why it didn't work for me, it's because my diff commands adds + and -, not < and > for differences - many thanks :) – user1108364 Dec 20 '11 at 18:12

jaypal singh · Answer 2 · 2011-12-20T18:14:29.223

This should work -

awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1 file2

Explaination:

NR and FNR are awk's built-in variables. NR registers the number of records and does not get reset to 0 when working with two files. FNR is similar to NR but gets reset to 0 after the file is completely parsed through.

In this awk one-liner, we keep that condition NR==FNR which is to force action {a[$0]++;next} only on the file1 (as NR==FNR will only be true till we are working with file1). This action stores each line in an array. next is added so that the second action does not get called upon. Once this NR==FNR becomes untrue, the first action is never called. awk moves to the second action which is to check the content of the file2 with respect to the array (i.e file1). If the content of file2 is in the array, we ignore it. If it is not there in the array we print it as those lines would be the ones that are extra and only in file2.

Test:

File1:

[jaypal:~/Temp] cat file1
The quick brown
fox jumped 
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog

File2:

[jaypal:~/Temp] cat file2
The quick brown
fox jumped
over the
lazy dog
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
The quick brown
fox jumped
over the
lazy dog
djakdjhgmv
asdjkljkgfyiyi
The quick brown
fox jumped
over the
lazy dog
jghytpuptou

Execution:

[jaypal:~/Temp] awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1 file2
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
djakdjhgmv
asdjkljkgfyiyi
jghytpuptou

Wow, that is superb - just what I was after, thanks so much. Now I have to read and try and understand why it works!!! — user1108364, Dec 20 '11 at 18:05

potong · Answer 3 · 2011-12-20T21:50:35.933

0

This might work for you (GNU diff):

diff -bu file1 file2 | sed -n '1,2d;s/^+//p'
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
djakdjhgmv
asdjkljkgfyiyi
jghytpuptou

edited Dec 20 '11 at 21:50

answered Dec 20 '11 at 21:05

potong

55,640
6
51
83

score 0 · Answer 4 · answered Dec 21 '11 at 21:50

0

diff -b sample.log sample.log.1 | awk '/>/ {print $2}'

answered Dec 21 '11 at 21:50

Compare 2 similar files and only output the differences, preserving the order in which they occur?

4 Answers4