How to get the output from the comm command into 3 separate files?

Question

The question Unix command to find lines common in two files has an answer suggesting the use of the comm command to do the task:

comm -12 1.sorted.txt 2.sorted.txt

This shows the lines common to the two files (the -1 suppresses the lines that are only in the first file, and the -2 suppresses the lines only in the second file, leaving just the lines common to both files as output). As the file names suggest, the input files must be in sorted order.

In a comment to that question, bapors asks:

How would one have the outputs in different files?

Seeking clarification, I asked:

If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab) you could use sed to split the output to three files.

User bapors confirmed:

It is exactly what I was asking. Would you show an example?

The answer is relatively long-winded and would spoil the simplicity of the answer to the other question (drowning it out with lots of information), so I've asked the question separately here — and provided an answer too.

Jonathan Leffler · Accepted Answer · 2017-09-21T06:26:58.220

The basic solution using sed relies on the fact that comm outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.

It also relies on sed's w command to write to files.

Given file 1.sorted.txt containing:

1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5

and file 2.sorted.txt containing:

1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5

the basic output from comm 1.sorted.txt 2.sorted.txt is:

1.line-1
1.line-2
        1.line-3
1.line-4
1.line-6
        2.line-1
                2.line-2
        2.line-4
        2.line-6
                3.line-5

Given a file script.sed containing:

/^\t\t/ {
    s///
    w file.3
    d
}
/^\t/ {
    s///
    w file.2
    d
}
/^[^\t]/ {
    w file.1
    d
}

you can run the command shown below and get the desired output like this:

$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$

The script works by:

matching lines that start with 2 tabs, deleting the tabs, writing the line to file.3, and deleting the line (so the rest of the script is ignored),
matching lines that start with 1 tab, deleting the tab, writing the line to file.2, and deleting the line (so the rest of the script is ignored),
matching lines that do not start with a tab, writing the line to file.1, and deleting the line.

The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1) and this script would work the same. However, see script3.sed below for further justification for keeping the symmetry.

As written, that requires GNU sed; BSD sed doesn't recognize the \t escapes. Obviously, the file could be written with actual tabs in place of the \t notation, and then BSD sed is OK with the script.

It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:

$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/  { s///\n w file.3\n d\n }' \
>     -e $'/^\t/    { s///\n w file.2\n d\n }' \
>     -e $'/^[^\t]/ {        w file.1\n d\n }'
$

which writes each of the three 'paragraphs' of script.sed in a separate -e option. The w command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.

Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed script removes the prefixes before writing to the output files.

script3.sed (with tabs replaced by up to 8 spaces) — note that this time there is a substitute s/// needed in the third paragraph (the d is still optional, but may as well be included):

/^              X/ {
    s///
    w file.3
    d
}
/^      X/ {
    s///
    w file.2
    d
}
/^X/ {
    s///
    w file.1
    d
}

And the command line:

$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$

For the same input files, this produces the same output, but by adding and then removing the X at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.

You can also easily write solutions that use Perl or Awk, and those do not even have to use comm (and can be made to work with unsorted files, provided the files fit into memory).

Awesome contribution. Wish we still had SO documentation. One thing to add for users of BSD sed. If you happen to be in FreeBSD, your `/bin/sh` while an Almquist shell that is *not* bash, includes C-style quoting similar to bash. — ghoti, Sep 21 '17 at 06:39
this does not seem to work if I add multiple tabs/spaces at the start(and between words) to the lines — RomanPerekhrest, Sep 21 '17 at 09:36
@RomanPerekhrest: Are you saying that the variant using `script3.sed` and the process substitution doesn't work when the data has leading tabs or multiple blanks or tabs within the line? If so, I'd like to see the sample data. Would you please send it to me via email — see my profile. One possible issue is that `sort` and `comm` don't see eye to eye about what it means for data to be in sorted order. You might need to set `LANG=C` in the environment to get that to work. — Jonathan Leffler, Sep 21 '17 at 11:33
@JonathanLeffler, see this screenshot https://ibb.co/cCScW5. In the top window - your sed script content, in the bottom window - locale, 2 files contents and final command result — RomanPerekhrest, Sep 21 '17 at 13:44
Note that the version of `comm` you're using correctly reports that both `file 1` and `file 2` are not in sorted order (error message output). The inputs to `comm` must be sorted. If the inputs are not sorted, then you get what you get. The algorithm used by `comm` is a variation on merging; it shuffles through each file, working out whether the lines are equal, or which comes first, and generates appropriate output. If the input is not sorted, the output is unreliable. I'm assuming that you replaced the relevant spaces in the script with tabs, too. — Jonathan Leffler, Sep 21 '17 at 16:12

score 0 · Answer 2 · answered Sep 21 '17 at 10:30

comm + awk solution:

Complicated sample files:

1.txt:

1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6
2.line-2
        3.line-5 (tabs)

2.txt:

1.line-3
  2.line-1 with spaces
2.line-2
2.line-4
    2.line-6 with tabs
        3.line-5 (tabs)

The job:

comm -12 1.txt 2.txt > file-common 
awk 'NR==FNR{ a[$0];next }!($0 in a){ print $0 > "file"ARGIND-1 }' file-common 1.txt 2.txt

comm -12 1.txt 2.txt > file-common - will save common lines to file-common file
awk ... - will print lines unique to 1.txt and 2.txt into files file1 and file2 respectively

Viewing results:

head file*
==> file1 <==
1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6

==> file2 <==
1.line-3
  2.line-1 with spaces
2.line-4
    2.line-6 with tabs

==> file-common <==
2.line-2
        3.line-5 (tabs)

How to get the output from the comm command into 3 separate files?

2 Answers2

Linked