How to outer-join two CSV files, using shell script?

Question

I have two CSV files, like the following:

file1.csv

label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"

file2.csv

label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"

Desired Output.csv

label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:

awk -F, 'NR==FNR{a[$1]=substr($0,length($1)+2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv

Do you need to use `awk`? Can you use a CSV-aware command line tool, that you'd (probably) have to install yourself? — Zach Young, Feb 09 '22 at 16:50
why does your `awk` command use a comma (`,`) as the input field delimiter when the sample data you've provided appears to show the pipe (`|`) as the input field delimiter? — markp-fuso, Feb 09 '22 at 16:58
@markp-fuso, OP originally posted the data as MD table, but didn't have a space between the filename and the table MD, so saw bad formatting, so they went back and removed the "bookmark" pipes on the ends, and removed the header line... leaving us with what you see now. I'm guessing their data does use a comma as the delimiter. — Zach Young, Feb 09 '22 at 17:09
@ZachYoung I don't see any commas in the original post/edit so the question stands ... what's the actual field delimiter? comma? pipe? something else? for a file suffix of `csv` I'd *assume* comma, and that matches with the `awk -F,`, but the provided data says otherwise; a simple cut-n-paste of the output from `cat file1.csv` would suffice ... — markp-fuso, Feb 09 '22 at 17:17
@ZachYoung & markp-fuso the delimiter is ( , ) just for formatting i had added that pipes as correctly guessed by Zach ..there was some formatting issue when i posted so i edited and reposted it. — ak9092, Feb 09 '22 at 17:26
update the question with the output from running `cat filename`; we should be able to cut-n-paste from your question/post into our environment for testing; we should not have to cut-n-paste the current 'data' and then have to edit to replace all pipes with commas; provide an exact copy of your data (as is it looks like right now your delimiter is actually ``) — markp-fuso, Feb 09 '22 at 18:05
@ak9092 For handling correctly a quoted CSV with `awk` you might need to write your own parser. There are some awk libraries for that, but if you'll be processing this kind of CSV regularly then It's easier to add a CSV aware tool to your environment. — Fravadona, Feb 10 '22 at 12:07
@Fravadona, you are absolutely right, I habe added an example using a CSV parser for AWK: https://stackoverflow.com/a/71452768/18135892 — datatraveller1, Mar 12 '22 at 20:37

Fravadona · Answer 1 · 2022-02-12T14:00:47.517

I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive, put the mlr executable somewhere (preferably in your PATH) and you're done with the installation.

mlr --csv \
    join -f file1.csv -j 'label' --ul --ur \
    then \
    unsparsify --fill-with 'NA' \
    then \
    sort -f 'label' \
    file2.csv

Command parts:

mlr --csv
means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would be mlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur ...... file2.csv
means to join file1.csv and file2.csv on the field label and emit the unmatching records of both files
then is Miller's way of chaining operations
unsparsify --fill-with 'NA'
means to create the fields that didn't exist in each file and fill them with NA. It's needed for the records that had a uniq label
then sort -f 'label'
means to sort the records on the field label

Regarding the updated question: mlr handles the CSV quoting on its own. The only difference with your new expected output is that it removes the superfluous quotes:

label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

Very happy that you are using it more and more. It's a great tool — aborruso, Feb 10 '22 at 09:02
Yes it is. I use it for quick tasks, just like `jq` and `yq`. — Fravadona, Feb 10 '22 at 12:45

datatraveller1 · Answer 2 · 2022-02-10T22:38:38.493

This solution prints exactly the wished result with any AWK. Please note that the sorting algorithm is taken from the mawk manual.

# SO71053039.awk

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","
  out = "Output.csv"

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", LABEL_KEY_SWAP[i])
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
    print result_string > out
  }
}

Call:

awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

Kaffe Myers · Answer 3 · 2022-02-09T20:13:02.997

2

awk -v OFS=, '{
        if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
    } 
    END{
        for(v in o1) { print v, o1[v], o2[v] }
    }' file{1,2}

## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4

I think this will do nicely.

edited Feb 09 '22 at 20:13

answered Feb 09 '22 at 20:05

Kaffe Myers

424
3
9

Zach Young · Answer 4 · 2022-02-09T18:59:19.507

Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).

It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.

The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:

gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv

Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC   | 2      |       |        |
+-------+--------+-------+--------+
| XYZ   | 3      | XYZ   | 4      |
+-------+--------+-------+--------+
| PQR   | 6      | PQR   | 6      |
+-------+--------+-------+--------+
|       |        | LMN   | 8      |
+-------+--------+-------+--------+
|       |        | EFG   | 1      |
+-------+--------+-------+--------+

The data-part is correct, but it'll take some work to get the columns correct, and to get the NA values in there.

Here's a complete pipeline:

gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv

echo 'Final'
gocsv view final.csv

which gets us the correct, final, file:

Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC   | 2      | NA     |
+-------+--------+--------+
| EFG   | NA     | 1      |
+-------+--------+--------+
| LMN   | NA     | 8      |
+-------+--------+--------+
| PQR   | 6      | 6      |
+-------+--------+--------+
| XYZ   | 3      | 4      |
+-------+--------+--------+

There's a lot going on in that pipeline, the high points are:

Merge the the two label fields

| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \

Pare-down to just the 3 columns you want

| gocsv select -c 'label','Part-A','Part-B' \

Add the NA values and sort by label

| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \

I've made a step-by-step explanation at this Gist.

This is good but currently i don't have the liberty to install/setup another dependent tool in my current environment so was hoping if this could be done with something like awk, join, etc. which i can easily incorporate into shell script , then that would have been great. Thanks though!! — ak9092, Feb 09 '22 at 19:27
@ak9092, thanks for reminding me of **join**. I added another solution, native-shell utils only. — Zach Young, Feb 09 '22 at 22:00

Zach Young · Answer 5 · 2022-02-10T17:33:05.647

You mentioned join in the comment on my other answer, and I'd forgotten about this utility:

#!/bin/sh
rm -f *sorted.csv

# Join two files, normally inner-join only, but
# -  `-a 1 -a 2`:    include "unpaired lines" from file 1 and file 2
# -  `-1 1 -2 1`:    the first column from each is the "join column"
# -  `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2

join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv 

# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv

# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv

# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv

And, here's sorted.csv

+--------+--------+--------+
| label  | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0    | NA     |
+--------+--------+--------+
| EFG    | NA     | 1.0    |
+--------+--------+--------+
| LMN Wv | NA     | 8      |
+--------+--------+--------+
| PQR SN | 6      | 6      |
+--------+--------+--------+
| XYZ    | 3.0    | 4.0    |
+--------+--------+--------+

Thanks again for this solution with join but just found out some of my data in label has a string value, so in that case this is not working and creating two entries for same label. I have updated the question sample data — ak9092, Feb 10 '22 at 10:04
I just ran that script with your new inputs and get the expected output, see my **sorted.csv**. Still, this is getting tricky and now you're definitely in the arena of needing a "real CSV parsing solution". What's the limitation with installing good software to do the job correctly? Is this a personal or professional thing? — Zach Young, Feb 10 '22 at 17:34

Dudi Boy · Accepted Answer · 2022-02-10T12:38:47.820

1

We suggest gawk script which is standard Linux awk:

script.awk

NR == FNR {
  valsStr = sprintf("%s,%s", $2, "na");
  rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
  split(rowsArr[$1],valsArr);
  valsStr = sprintf("%s,%s", valsArr[1], $2);
  rowsArr[$1] = valsStr;
  next;
}
NR != FNR {
  valsStr = sprintf("%s,%s", "na", $2);
  rowsArr[$1] = valsStr;
}
END {
  printf("%s,%s\n", "label", rowsArr["label"]);
  for (rowName in rowsArr) {
     if (rowName == "label") continue;
     printf("%s,%s\n", rowName, rowsArr[rowName]);
  }
}

output:

awk -F, -f script.awk input.{1,2}.txt

label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4

edited Feb 10 '22 at 12:38

answered Feb 10 '22 at 06:40

Dudi Boy

4,551
1
15
30

this seems to be doing the trick but its not keeping the headers intact, currently headers are appearing anywhere in the data. Any idea how can i keep the headers intact on first line ? – ak9092 Feb 10 '22 at 09:47
fixed for your request. – Dudi Boy Feb 10 '22 at 12:37

datatraveller1 · Answer 7 · 2022-03-12T21:29:53.813

As @Fravadona stated correctly in his comment, for CSV files that can contain the delimiter, a newline or double quotes inside a field a proper CSV parser is needed.

Actually, only two functions are needed: One for unquoting CSV fields to normal AWK fields and one for quoting the AWK fields to write the data back to CSV fields.

I have written a variant of my previous answer (https://stackoverflow.com/a/71056926/18135892) that uses Ed Morton's CSV parser (https://stackoverflow.com/a/45420607/18135892 with the gsub variant which works with any AWK version) to give an example of proper CSV parsing:

This solution prints the wished result correctly sorted with any AWK. Please note that the sorting algorithm is taken from the mawk manual.

# SO71053039_2.awk

# unquote CSV:
# Ed Morton's CSV parser: https://stackoverflow.com/a/45420607/18135892
function buildRec(      fpat,fldNr,fldStr,done) {
    CurrRec = CurrRec $0
    if ( gsub(/"/,"&",CurrRec) % 2 ) {
        # The string built so far in CurrRec has an odd number
        # of "s and so is not yet a complete record.
        CurrRec = CurrRec RS
        done = 0
    }
    else {
        # If CurrRec ended with a null field we would exit the
        # loop below before handling it so ensure that cannot happen.
        # We use a regexp comparison using a bracket expression here
        # and in fpat so it will work even if FS is a regexp metachar
        # or a multi-char string like "\\\\" for \-separated fields.
        CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
        $0 = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
        while ( (CurrRec != "") && match(CurrRec,fpat) ) {
            fldStr = substr(CurrRec,RSTART,RLENGTH)
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
                gsub(/""/, "\"", fldStr)
            }
            $(++fldNr) = fldStr
            CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
        }
        CurrRec = ""
        done = 1
    }
    return done
}

# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
  if ((field ~ sep) || (field ~ /["\r\n]/)) {
    gsub(/"/, "\"\"", field)
    field = "\"" field "\""
  }
  return field
}

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "Part-A" OFS "Part-B"

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
    print result_string
  }
}

Call:

awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

How to outer-join two CSV files, using shell script?

7 Answers7

script.awk

output: