The basic solution using sed
relies on the fact that comm
outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.
It also relies on sed
's w
command to write to files.
Given file 1.sorted.txt
containing:
1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5
and file 2.sorted.txt
containing:
1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5
the basic output from comm 1.sorted.txt 2.sorted.txt
is:
1.line-1
1.line-2
1.line-3
1.line-4
1.line-6
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5
Given a file script.sed
containing:
/^\t\t/ {
s///
w file.3
d
}
/^\t/ {
s///
w file.2
d
}
/^[^\t]/ {
w file.1
d
}
you can run the command shown below and get the desired output like this:
$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$
The script works by:
- matching lines that start with 2 tabs, deleting the tabs, writing the line to
file.3
, and deleting the line (so the rest of the script is ignored),
- matching lines that start with 1 tab, deleting the tab, writing the line to
file.2
, and deleting the line (so the rest of the script is ignored),
- matching lines that do not start with a tab, writing the line to
file.1
, and deleting the line.
The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1
) and this script would work the same. However, see script3.sed
below for further justification for keeping the symmetry.
As written, that requires GNU sed
; BSD sed
doesn't recognize the \t
escapes. Obviously, the file could be written with actual tabs in place of the \t
notation, and then BSD sed
is OK with the script.
It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:
$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/ { s///\n w file.3\n d\n }' \
> -e $'/^\t/ { s///\n w file.2\n d\n }' \
> -e $'/^[^\t]/ { w file.1\n d\n }'
$
which writes each of the three 'paragraphs' of script.sed
in a separate -e
option. The w
command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n
after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed
file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed
script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.
Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed
script removes the prefixes before writing to the output files.
script3.sed
(with tabs replaced by up to 8 spaces) — note that this time there is a substitute s///
needed in the third paragraph (the d
is still optional, but may as well be included):
/^ X/ {
s///
w file.3
d
}
/^ X/ {
s///
w file.2
d
}
/^X/ {
s///
w file.1
d
}
And the command line:
$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$
For the same input files, this produces the same output, but by adding and then removing the X
at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.
You can also easily write solutions that use Perl or Awk, and those do not even have to use comm
(and can be made to work with unsorted files, provided the files fit into memory).