-5

I would like to compare two files, say file1 and file2, and output two new files, say file1.out and file2.out, with the lines that are in common (according to diff file1 file2) occurring first, and then the lines in file1 that are not in file2 appended at the end of file1.out, and the lines in file2 but not in file1 appended at the end of file2.out.

For example, let's say I have file1:

A
B
C
E

and file2:

A
C
D
E

I would like to common lines, A, C, and E to come first in modified files file1.out and file2.out, in their original order, and the distinct lines, B and D respectively, to be moved to the end. With my example, that would yield file1.out:

A
C
E
B

and file2.out:

A
C
E
D

More generally, my input files might have thousands of lines that are mostly the same, with some scattered differences that I would like pushed at the end for easier visual inspection.

I have looked at related-type queries such as here (Compare two files line by line and generate the difference in another file) but I did not find the solution I am looking for here. If you know how to generate output as described above, that would be greatly appreciated.

joanis
  • 10,635
  • 14
  • 30
  • 40
baban
  • 129
  • 11
  • Um, generate the `diff` to a temporary file, then `cat` that to append to the new file? Sadly you probably shouldn't try to append the output of `cat` directly to the file, as it might pick up the new content. You might instead diff to a variable and then echo the variable to append, if the diff is not too big. Bash will then do the temporary files for you. – Gem Taylor Jul 22 '19 at 18:35
  • Thanks, would it be possible for you to provide sample code (using my inputs) of what you said? I am really not an expert in bash. – baban Jul 22 '19 at 18:48

1 Answers1

3

I think you can solve this problem by using diff -U <large number>. This will give you output that will be easy to parse to reconstruct what you want. If <large number> is larger than the longer of your two files, then you will get a predictable output format:

$diff -u 1000 file1 file2
--- file1       2019-07-22 14:39:39.344674000 -0400
+++ file2       2019-07-22 14:39:45.072654000 -0400
@@ -1,4 +1,4 @@
 A
+B
 C
-D
 E

Then you can use grep and sed to reconstruct the two output files you want:

diff -u 1000 file1 file2 | sed '1,3d' > tmp
grep '^ ' tmp | sed 's/^ //' > file1.out
cp file1.out file2.out
grep '^-' tmp | sed 's/^-//' >> file1.out
grep '^+' tmp | sed 's/^+//' >> file2.out

Notes:

  • sed '1,3d' just deletes the first three lines of the diff output, since they're not contents. I previously had tail +3 here but that is not so portable; sed is safer.
  • The first grep extracts lines in common (start with a space in the diff).
  • The next two greps extract lines not in common (- means in file1 only, + in file2 only).
  • If file1 and file2 are identical, this will yield empty output files.
joanis
  • 10,635
  • 14
  • 30
  • 40
  • Somehow, I am getting the error 'head: cannot open '+3' for reading: No such file or directory' while trying diff -U 1000 file1 file2 | head +3' command. However, I could see though the output without using the 'head +3' like you showed above. – baban Jul 22 '19 at 19:21
  • I guess it's a matter of versions of `tail`. I think GNU tail supports the `+3` syntax. The point is just to delete the first three lines of the diff, since they're not actual contents. – joanis Jul 22 '19 at 19:25
  • 1
    You can use `sed '1,3d'` to delete the first three lines of a file, instead of `tail +3`. I'll edit my answer. – joanis Jul 22 '19 at 19:29