1

Is there a good way to perform 'full outer join' multiple files on Unix ideally using GNU coreutils. My files were produced by uniq -c, below is a mock example the can be generated by:

echo "12 aa\n3 bb" > file1
echo "5 aa\n6 bb\n1 cc" > file2
echo "11 aa\n7 bb\n4 dd" > file3

and looks like:

tail -n +1 file*
==> file1 <==
12 aa
3 bb

==> file2 <==
5 aa
6 bb
1 cc

==> file3 <==
11 aa
7 bb
4 dd

I want to merge them one by one, using the sequence (column 2) as key filling in a 0 if the key is not included (outer join). I.e. the desired output would look like this

12 5 11 aa
3 6 6 bb
0 1 0 cc
0 0 4 dd

So far I found join to be doing the job at least for pairwise merging:

join -j2 file1 file2 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged
# 12 5 aa
# 3 6 bb
# 0 1 cc

Note: j2 : looking at second column for key (for both files) -a FILENUM: also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

But I don't know how to generalize it for multiple files, i.e. this one doesn't work, which means I can't easily put it in a loop:

join -j2 merged file3 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged2

I'd be ideally not want to use SQL to achive this, but would be ok if there wasn't any other way.

2 Answers2

1

You can do what you want with grep, sed and awk, if you have the time to learn how to use them.

Otherwise, you can use this inefficient solution:

printf '12 aa\n3 bb\n' > file1
printf '5 aa\n6 bb\n1 cc\n' > file2
printf '11 aa\n7 bb\n4 dd\n' > file3
printf 'file1\nfile2\nfile3\n' > files

cat files | while read fil; do
    sed 's/^[0-9]* //' "$fil"
done | sort | uniq > lines

cat lines | while read line; do
    cat files | while read fil; do
        num=$(grep "$line" "$fil" | sed -E 's/^([0-9]+)[^0-9].*$/\1/')
        if [ "$num" = "" ]; then
            printf '0 '
        else
            printf '%s ' "$num"
        fi
    done
    printf '%s\n' "$line"
done > result

cat result

The basic tool used in this script is the read builtin command of the shell. With it, you can parse a file line by line.

The principle is: for each "line" (aa, bb, cc, dd), you fetch the number of occurrences of this line in each file (file1, file2, file3), using grep. With this information, you can produce the output you are looking for.

RalphS
  • 627
  • 4
  • 15
  • This is certainly going in the right direction, although I had imagined a more elegant solution with `join`, this might do the job as well. However `lines` here was derived manually. In order to fully answer the question, `lines` need to be derived from the data since not know beforehand. – Sebastian Müller Dec 07 '20 at 13:21
  • @SebastianMüller I changed the code to derive `lines` from the data. I do not know enough about `join` to provide a solution with it. If I had to spent time on this, I would learn `awk` instead of `join` (because you can do almost anything with `awk`). – RalphS Dec 13 '20 at 05:54
0

Finally figured out an efficient solution using sort packed in the following bash script multi_join_from_uniq.sh:

#!/bin/sh

join_rec() {
  file1=$1
  file2=$2
  shift 2
  if [ $# -gt 0 ]; then
    join -e0 -a 1 -a 2 "$file1" -o auto "$file2" | join_rec - "$@"
  else
    join -e0 -a 1 -a 2 "$file1" -o auto "$file2"
  fi
}

join_rec "$@"

However, the script only works only for sorted and swapped columns, which can be done as process substitution to make this example work:

sh multi_join_from_uniq.sh <(awk '{print $2,$1}' file1 | sort) <(awk '{print $2,$1}' file2 | sort ) <(awk '{print $2,$1}' file3 | sort )

resulting in the following output:

aa 12 5 11
bb 3 6 7
cc 0 1 0
dd 0 0 4

Of course, this can be also done when piping the the output of the uniq command e.g. this way cat data.txt | uniq -c | awk '{print $2,$1}' | sort > file1 or so.