1

I have 3 CSV files:

Base File(values initialised with 0)

steve tignor ash michael jose sam joshua
0       0     0     0     0     0    0

File 1:

tignor michael  jose
888      9       -2

File 2:

ash joshua
77   66

Output I need:

      steve tignor ash michael jose sam joshua
File1   0    888    0    9      -2   0     0
File2   0     0     77   0       0   0     66

I tried with sorting the files first with awk and then merge with paste but as I have 1000+ columns and having 30 files it just did not work.

Code:

awk -F"," 'NR==1{
  split($0,a,FS);asort(a);
  for(i=1;i<=NF;i++)b[$i]=i
} {
  for(i=1;i<=NF;i++)printf("%s,",$(b[a[i]]));
  print x
}' File1 > 1.csv

awk -F"," 'NR==1{
  split($0,a,FS);asort(a);
  for(i=1;i<=NF;i++)b[$i]=i
} {
  for(i=1;i<=NF;i++)printf("%s,",$(b[a[i]]));
  print x
}' File2 > 2.csv

paste -d"\n" 1.csv 2.csv > merge.csv

Need some assistance here. Thanks in advance.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
Rish
  • 1
  • 4

1 Answers1

1

I assumed that you omitted the commas in the files. If you're using space separated files you could just change the separator used in the split function.

awk '
ARGIND==1 && FNR==1{
  split($0, base, ",")
  printf("file,%s\n",$0)
}
ARGIND > 1 && FNR==1{
  split($0, names, ",")
  printf("%s", ARGV[ARGIND])
}
ARGIND > 1 && FNR==2{
  split($0, values, ",")
  for(i in names)
    line[names[i]] = values[i]
  for(i in base){
    if(base[i] in line)
      printf(",%s", line[base[i]])
    else
      printf(",0")
  }
  delete line
  print ""
}
' base.csv file1.csv file2.csv

Example:

file1.csv:

tignor,michael,jose
888,9,-2

file2.csv:

ash,joshua
77,66

and base.csv:

steve,tignor,ash,michael,jose,sam,joshua
0,0,0,0,0,0,0

the output is:

file,steve,tignor,ash,michael,jose,sam,joshua
file1.csv,0,888,0,9,-2,0,0
file2.csv,0,0,77,0,0,0,66

Basically, the script is running in 2 steps:

  • First we read the names from the base.csv and store them into an array.
  • Then, for each file we store the names appearing in its header and try to print one value for each column in the base csv. if we don't have the value corresponding to a column in a particular file we just print 0 instead.

P.S. I made a new POSIX awk compatible version of the script:

awk --posix '
NR==FNR && FNR==1{
  split($0, base, ",")
  printf("file,%s\n",$0)
}
NR>FNR && FNR==1{
  split($0, names, ",")
  printf("%s", FILENAME)
}
NR>FNR && FNR==2{
  split($0, values, ",")
  for(i in names)
    line[names[i]] = values[i]
  for(i in base){
    if(base[i] in line)
      printf(",%s", line[base[i]])
    else
      printf(",0")
  }
  delete line
  print ""
}
' base.csv file1.csv file2.csv
lucaslugao
  • 173
  • 6
  • Hi Lucas, It worked like a charm. You have used split function very smartly. If I have understood well FNR==2 is for selecting last two files. Can you please explain a little on usage of ARGIND and ARGV ? – Rish Jun 25 '18 at 05:19
  • Hi! You can find the definition for all built-in variables [here](https://www.math.utah.edu/docs/info/gawk_11.html). Just to save you some time reading the docs, I used ARGIND (which stands for Argument Index) to select the file that I was processing, this variable is present in gnu awk only, for regular awk take a look [here](https://stackoverflow.com/questions/24739570/how-to-handle-3-files-with-awk)). The FNR variable represents the line (or record) number inside a file as opposed to NR which counts the number of records globally (not restarting for each file). – lucaslugao Jun 25 '18 at 05:58
  • So to summarise the filters used: `ARGIND==1 && FNR==1` -> First file (base.csv) and first line `ARGIND > 1 && FNR==1` -> All files after the first and first line `ARGIND > 1 && FNR==2`-> All files after the first and second line To make it awk compatible (not only gawk) you could change the filters by: `ARGIND==1 && FNR==1 -> NR==FNR && FNR==1` `ARGIND > 1 && FNR==1 -> NR>FNR && FNR==1` `ARGIND > 1 && FNR==2 -> NR>FNR && FNR==2` The `ARGIND==1` is equivalent to `NR==FNR` as explained [here](https://stackoverflow.com/a/32482115/4607321). – lucaslugao Jun 25 '18 at 06:04
  • I forgot to explain about the `ARGV`, it is just the array containing the filenames of the input files. The expression `ARGV[ARGIND]` then corresponds to the current filename and could be replaced by the variable `FILENAME`. I don't know why I used `ARGV[ARGIND]` instead (maybe because it is closer to some other programming languages). – lucaslugao Jun 25 '18 at 06:16
  • Hi Lukas, Untill now the logic was working perfectly and I was able to parse my files perfectly. But now I am facing issues with it where I see there are extra cols getting added to for each file. Would you like to have a look please ? I am not able to figure where the logic breaks. $ cat 4G.csv | awk -F"," '{print NF}' 1314 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 1319 .. As you can see there are 5 extra columns introduced after applying the awk code. If you want I can share you the exact files. – Rish Jun 13 '20 at 05:40
  • Hi Lukas, I figured out the issue. The issue is with awk RHEL 6 version with GNU package -bash-4.1$ awk --version GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation. – Rish Jun 17 '20 at 01:25