Merge CSV files using join/awk/sed

Question

Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv" into "ouput.csv".

For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:

template.csv:

header
1
2
3
4
5
6
7
8
9

file1.csv:

header,value1
2,value12
3,value13
7,value17
8,value18
9,value19

file2.csv:

header,value2
1,value21
2,value22
3,value23
4,value24

file3.csv:

header,value3
2,value32
4,value34
6,value36
7,value37
8,value38

output.csv:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc...) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, ... but I don't see how ...

IMPORTANT UPDATE

The first column of my real files are containing a datetime and not a simple number ... so the script must take into account the space between the date and the time ... sorry for the update !

Script should be now designed with the below csv files as example:

template.csv:

header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00

file1.csv:

header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19

file2.csv:

header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24

file3.csv:

header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38

output.csv:

header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Ed Morton · Answer 1 · 2014-12-22T17:00:55.000

3

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR { key[++numRows] = $1 }
{ fld[$1,ARGIND] = $NF }
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", fld[key[rowNr],colNr], (colNr<ARGIND ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk template.csv file1.csv file2.csv file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

The above uses GNU awk for ARGIND, with other awks just add a line that says FNR==1 { ++ARGIND }.

edited Dec 22 '14 at 17:00

answered Dec 22 '14 at 16:26

Ed Morton

188,023
17
78
185

This solution is also working perfectly ... a bit more complex than the command proposed by user43791 – Adrien Dec 22 '14 at 16:41
You really think just populating and printing an array of fields indexed by column and row is more complex than a chain of piped joins with obscure options? It seems very straight forward to me but YMMV I suppose. – Ed Morton Dec 22 '14 at 16:50
@EdMorton To Adrien's defense, a pipeline of a documented and very strait-forward tool using three easily understandable options (via `man`) does sound simpler to me than a specifically crafted program in awk. That being said, I do like your answer since it is only using a single process, is pretty elegantly written and probably runs pretty fast too. Have an upvote! ;) – user43791 Dec 22 '14 at 17:06
Oh well, horses for course I suppose, maybe I'm just stuck in a paradigm because in 30+ years of UNIX programming I've never come across the `-o auto` option for join. Having said that, try modifying a chain of joins to print a warning when one of the files contains a date that's not in the template file or anything else unexpected - trivial to do when your stating point is an awk command but otherwise.... ;-). – Ed Morton Dec 22 '14 at 17:14

Wintermute · Answer 2 · 2014-12-22T16:34:17.407

1

This should work (for explanation read the comments):

#!/bin/sh

awk -F, -v file=0 '
  FNR == 1 {                     # first line in the file
    if(file == 0) {              # if in first file (template.csv):
      header = $1                # init header
    } else {
      header = header "," $2     # else append field name
    }
    next                         # forward to next line.
  }
  file == 0 {                    # if in first file:
    key[FNR] = $1                # remember key
    next                         # next line.
  }
  {
    field[$1][file] = $2         # otherwise: remember field
  }
  ENDFILE {                      # at the end of a file:
    file = file + 1              # increase counter
  }
  END {                          # in the end, assemble and
    print header                 # print lines.
    asort(key)
    for(k in key) {
      line = ""
      for(i = 1; i < file; ++i) {
        line = line "," field[key[k]][i]
      }
      print key[k] line
    }
  }
  ' template.csv file1.csv file2.csv file3.csv

edited Dec 22 '14 at 16:34

answered Dec 22 '14 at 11:05

Wintermute

42,983
5
77
80

Works fine (also with the last update) but the result is not sorted by date as in the template. – Adrien Dec 22 '14 at 13:15
@Adrien If this works for you, you could always sort the output by adding `|sort` at the end of the command line. – user43791 Dec 22 '14 at 16:24
I tried to sort the output but it is not so easy (i didn't found the good sort command) : header (first line) should not be sorted, date should be sorted in two times (1st sort : date + 2nd sort : time) ... i gave up ! – Adrien Dec 22 '14 at 16:31
It was already working before the last small changes (i didn't notice any change between the two version). Date&time are still not sorted as the template. – Adrien Dec 22 '14 at 17:19

score 1 · Accepted Answer · answered Dec 22 '14 at 16:05

You could use multiple calls to join :

join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv

Or more clearer :

alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv

Explanation :

-t , specifies the field separator (,)
-a 1 instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)
-o auto controls formatting and is necessary to print the empty fields

Proof :

$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Note :

For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sort command if this is not the case.

Just perfect! It is working like a charm and it is very fast :-) Thank you — Adrien, Dec 22 '14 at 16:27

Jan Legner · Answer 4 · 2014-12-22T11:31:18.797

0

I would go with this, however it surely is not the fastest running solution, but for your data it returns correct result and code is short:

#!/bin/bash
CONTENT=$(cat template.scv)
for line in $CONTENT; do
    TMP=$(echo $line)
    for file in file1.csv file2.csv file3.csv; do
        RESULT=$(grep "^$line," $file | cut -d',' -f2)
        TMP=$(echo $TMP,$RESULT)
    done
    echo $TMP
done

output:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

EDIT: my code was missing a comma (,), so for longer ids it did not work properly EDIT 2: Well it is not "not the fastest solution", it is really slow one

edited Dec 22 '14 at 11:31

answered Dec 22 '14 at 11:09

Jan Legner

644
6
13

Similar to the script that i've already developped. Too slow regarding the amonth of lines in the csv files. – Adrien Dec 22 '14 at 13:18
Agreed - luckily other answers provided useful solutions. – Jan Legner Dec 22 '14 at 22:35

Merge CSV files using join/awk/sed

4 Answers4

Linked