Faster way to merge multiple files with unequal number of rows by column in bash

Question

I have a multiple files that I want to merge by column using shell script, let's say file a.txt and file b.txt. File a.txt contains sorted unique value and the first column will be used as a reference value.
example :

# cat a.txt 
001|johan
002|mike
003|adam

# cat b.txt
001|chu
001|stewart
002|lewis
002|jordan
003|lambert
003|johnson
003|smith
003|long

The both files will be combined to produce an output like below.

# cat c.txt
001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long

I tried simple iteration using while do.

while read line
do
  ids=`echo $line | awk -F"|" '{print $1}'`
  fn=`grep $ids a.txt`
  echo $fn"|"$line | awk -F"|" '{print $1"|"$2" "$4}'
done < b.txt > c.txt

But it takes a lot of time if I have million number of rows.
In MySQL we can easily achieve it using JOIN clause. But we need to load/insert them first.
Faster approach is probably using paste command, but as far as I know, the total rows of both files must be equal. Well I can adjust a.txt first. But still comsumes a lot of time when the script running.
Maybe someone has a better approach.

what do we do with a row that exists in `a.txt` but has no match in `b.txt`? what to do with a row in `b.txt` that does not have a match in `a.txt`? what is the max size of both files (MBytes, # of rows) (this will help determine if we can run the operation in memory)? please update the question with your answers; if the data is already sorted (in both files) you may may be able to use `join` — markp-fuso, Feb 20 '22 at 23:04

score 3 · Accepted Answer · answered Feb 20 '22 at 22:53

3

You could put all in one awk script:

awk -F'|' '{if (NR==FNR) a[$1]=$2; else print $1 "|" a[$1] " " $2}' a.txt b.txt

001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long

answered Feb 20 '22 at 22:53

pmf

24,478
2
22
31

I didn't know `awk` could do such a thing. Splendid. Thank you very much. – Dedy K Wijaya Feb 21 '22 at 00:22

markp-fuso · Answer 2 · 2022-02-20T23:29:21.267

Assumptions:

both files are sorted by the 1st column
ignore rows that do not have a match in the opposite file

One idea using join and sed (to remove the 2nd |):

join -t'|' -o 1.1,1.2,2.2 a.txt b.txt | sed -E 's/\|([^|]*$)/ \1/'

One awk idea:

awk '
BEGIN   { FS=OFS="|" }
FNR==NR { a[$1]=$2; next }
$1 in a { print $1,a[$1] " " $2 }
' a.txt b.txt

Both of these generate:

001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long

Faster way to merge multiple files with unequal number of rows by column in bash

2 Answers2