Extract specific cells from csv file with AWK, and order according to a predefined order

Question

I have a group of CSV files. For each file i need:

Extract specific cells.
Order them according to predefined order, which located in other file.
Append the result to a new file (concatenate all to the same file).

Example for a file (values1.csv):

Item, avg, max
TT, 3, 5
DD, 3, 6
ZZ, 6, 8
UU, 3, 3
JJ, 1, 5

Example for a predefined order (order.csv). I need all of avg and just a few of max:

DD_avg
ZZ_avg
ZZ_max
TT_avg
TT_max
UU_avg
JJ_avg

Output:

  file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
  values1.csv, 3, 6, 8, 3, 5, 3, 1
  values2.csv, ...................
  values3.csv, ...................

Is this possible with AWK (or any other Linux command)? My AWK skills are pretty limited, and i don't know how to approach this case. I Would appreciate some help and guidance here.

Edit: Real data

cat values1.csv

item,avg,max
System/CPU/User/percent,4.8,
System/Memory/Used/bytes,57300000000,
System/Filesystem/^data/Used/bytes,859000000,
System/Disk/disk/Reads/count/sec,37.8,730
System/Disk/disk/Writes/Utilization/percent,7.24,
System/Disk/disk/Reads/bytes/sec,849000,42100000
System/Disk/disk/Writes,0.0026,
System/Disk/disk/Writes/bytes/sec,520000,33500000
System/Disk/disk/Writes/count/sec,46.2,903
System/Disk/disk/Utilization/percent,22.4,
System/Disk/disk/Reads/Utilization/percent,15.2,

Cat order.csv

System/CPU/User/percent_avg
System/Memory/Used/bytes_avg
System/Filesystem/^data/Used/bytes_avg
System/Disk/disk/Reads/count/sec_avg
System/Disk/disk/Writes/count/sec_avg
System/Disk/disk/Reads/count/sec_max
System/Disk/disk/Writes/count/sec_max
System/Disk/disk/Reads/bytes/sec_avg
System/Disk/disk/Writes/bytes/sec_avg
System/Disk/disk/Writes/Utilization/percent_avg
System/Disk/disk/Reads/Utilization/percent_avg

For robust CSV parsing and handling variations in columns per file, you might want to look at the Python implementation. — randomir, Jul 27 '17 at 01:53

score 3 · Accepted Answer · answered Jul 26 '17 at 17:56

3

With GNU awk for ARGIND:

$ cat tst.awk
BEGIN { FS=", *"; OFS=", " }
NR==FNR {
    colNames[++numCols] = $0
    next
}
{
    val[ARGIND,$1"_avg"] = $2
    val[ARGIND,$1"_max"] = $3
}
END {
    printf "file_name"
    for (colNr=1; colNr<=numCols; colNr++) {
        printf "%s%s", OFS, colNames[colNr]
    }
    print ""
    for (fileNr=2; fileNr<=ARGIND; fileNr++) {
        printf "%s", ARGV[fileNr]
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", OFS, val[fileNr,colNames[colNr]]
        }
        print ""
    }
}

$ gawk -f tst.awk order.csv values1.csv
file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
values1.csv, 3, 6, 8, 3, 5, 3, 1

With other awks just add a FNR==1{++ARGIND} line right after the BEGIN line. If memory is an issue you can use less with gawks ENDFILE statement instead of END and there's other options - let us know if that is a concern.

answered Jul 26 '17 at 17:56

Ed Morton

188,023
17
78
185

Ed, thanks for your help. For some reason, i get the following output, and i didn't find what is the issue: `,System/Disk/disk/Reads/Utilization/percent_avgg values1.csv, , , , , , , , , , ,`. I've only changed the delimiter to `,` (no space), and edited the delimiter parameter in the script. I've added above my real data. Do you know what could be the issue? – Omri Jul 26 '17 at 21:37
You probably have control-Ms in your input file (as produced by Windows tools). Run dos2unix or similar on it before running any UNIX tool on it. For next time - if your real input file doesn't contains spaces (or any other characters/strings) then don't add them to the example you post in your question so you don't end up with a solution to a problem you don't have! – Ed Morton Jul 26 '17 at 22:37
1

Works great. Thank you so much! – Omri Jul 26 '17 at 23:03
It shouldn't be too hard to make this script handle more than 3 columns and in variable order, but, IMHO, parsing CSVs with `awk` is asking for trouble. :) I usually end up rewriting such scripts in Python. – randomir Jul 27 '17 at 02:16
Right, it'd be a trivial couple of extra lines. I've never met a CSV I couldn't parse easily with awk but YMMV I suppose. – Ed Morton Jul 27 '17 at 02:52
Just for fun, consider this 6 column CSV row: `val"ue, "two\nliner","it's ""ok""" ,"no,really",just "fi,ne"` (where `\n` is expanded). It's parsable with `awk`, but I can't think of any elegant way. – randomir Jul 28 '17 at 21:56
I only see 1 row of 4 columns in that CSV: 1) `val"ue, "two\nliner"`, 2) `"it's ""ok"""`, 3) "no,really", 4) `just "fi,ne"` while Excel sees 2 separate rows with everything before the newline on 1 row and everything after it on another. I actually don't think it's valid CSV in any of the normally accepted "standards". What do you think the 6 columns are? – Ed Morton Jul 29 '17 at 01:53
First, one correction: `val"ue ,"two\nliner","it's ""ok""" ,"no,really",just "fi,ne"` (I transposed first comma and space). Now, there are many "standards", which means there's no a definitive one. But I think [RFC 4180](https://tools.ietf.org/html/rfc4180) for `text/csv` is a good start (also check [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) on Wikipedia). Seems like OpenOffice, Google Docs and Python's `csv` module respect it, and they all see these 6 columns: 1) `val"ue `, 2) `two\nliner`, 3) `it's "ok"`, 4) `no,really`, 5) `just "fi`, 6) `ne"`. – randomir Jul 30 '17 at 12:45
I understand the issues around the many CSV "standards" and I have to parse CSVs a LOT so I come across all sorts but whatever "standard" you choose that line is invalid. For example like that RFC says (list item 5) `If fields are not enclosed with double quotes, then double quotes may not appear inside the fields` and wikipedia `Fields with embedded commas or double-quote characters must be quoted` so `val"ue ` **cannot** be a field. I understand some tool may guess at what you might have meant and display it to the best of its ability but that doesn't mean it's valid CSV in any "standard". – Ed Morton Jul 30 '17 at 13:19
How about you create a line in Excel that you think would be problematic, save it as CSV and post that (since no-one could argue about the validity of that CSV and it is the de-facto standard for CSVs these days due to how common it is) and then we can talk about parsing it with awk? Actually - can you post it as a new question so we're not fighting with formatting in comments? If you don't like Excel for some reason then feel free to create the CSV with some other tool as they'll all have some "standard" they conform to for creating CSVs and so presumably won't create invalid CSV. – Ed Morton Jul 30 '17 at 13:35
You're right, `val"ue` is invalid, but it's properly parsed by most (maybe all?). That RFC also says: `Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" when processing CSV files.` But I don't see a worthy question here (probably too broad), so let's just leave it at this. :) – randomir Jul 30 '17 at 14:13
OK. The thing is when input is invalid there is no "proper" way to parse it, it's all undefined behavior. Those tools guessed one thing, Excel guessed another and it turned out those other tools guesses were closer to what you meant than Excels was, but it could've gone the other way. The "proper" response IMHO would probably be to just say "this is invalid" and choke on it rather than trying to guess and you maybe not noticing they guessed "wrong"! Too bad, I was looking forward to tackling it and the solution would've been useful for anyone else trying to parse CSVs with awk in future. – Ed Morton Jul 30 '17 at 14:23
I agree. But we're seeing this kind of behaviour whenever there's a lack of standardization and/or everyone decides to make their own standard. Btw, why don't you [ask and answer](https://stackoverflow.com/help/self-answer) your own question on CSV awk parsing? – randomir Jul 30 '17 at 15:47
I just feel awkward doing that, especially if someone else proposes an answer - I feel obliged to accept it even if I feel like mine really is better. I'd rather have someone else judge the answers. – Ed Morton Jul 30 '17 at 15:54
1

@randomir ah what the heck, I did it anyway. See https://stackoverflow.com/q/45420535/1745001 – Ed Morton Jul 31 '17 at 16:15

Akshay Hegde · Answer 2 · 2017-07-26T17:59:54.497

akshay@db-3325:/tmp$ cat order 
DD_avg
ZZ_avg
ZZ_max
TT_avg
TT_max
UU_avg
JJ_avg

akshay@db-3325:/tmp$ cat values
Item, avg, max
TT, 3, 5
DD, 3, 6
ZZ, 6, 8
UU, 3, 3
JJ, 1, 5

akshay@db-3325:/tmp$ cat values1 
Item, avg, max
TT, 1, 3
DD, 2, 4

akshay@db-3325:/tmp$ awk  'BEGIN{FS=OFS=","}FNR==NR{o[oh[FNR]=$1];next}function p(){s="";for(i=1; i in oh; i++){ if(!hp){ hr=(hr?hr OFS:"") oh[i] }  s = (s ? s OFS:"")o[oh[i]]; o[oh[i]]="" } if(!hp){print "filename",hr; hp=1} print pf,s}k && FNR==1{p()}{gsub(/ /,""); for(i=2; i<=NF; i++){if(FNR==1){ h[i]=$i }else{ k = $1"_"h[i]; if(k in o)o[k]=$i } } pf=FILENAME }END{p()}' order values values1 
filename,DD_avg,ZZ_avg,ZZ_max,TT_avg,TT_max,UU_avg,JJ_avg
values,3,6,8,3,5,3,1
values1,2,,,1,3,,

Better Readable

awk '
 BEGIN{
     FS=OFS=","
 }
 FNR==NR{
        o[oh[FNR]=$1];
        next
 }
 function p(){
        s="";
        for(i=1; i in oh; i++){ 
           if(!hp){hr=(hr?hr OFS:"") oh[i]}  
           s = (s ? s OFS:"")o[oh[i]]; 
           o[oh[i]]="" 
        } 
        if(!hp){ print "filename",hr; hp=1} 
        print pf,s
 }
 k && FNR==1{ p() }
 {
    gsub(/ /,""); 
    for(i=2; i<=NF; i++)
    {
       if(FNR==1){ 
          h[i]=$i 
       }
       else{ 
          k = $1"_"h[i]; 
          if(k in o)o[k]=$i 
       } 
    } 
       pf=FILENAME 
 }
 END{
   p()
 }
' order values values1

score 2 · Answer 3 · answered Jul 26 '17 at 17:56

awk to the rescue!

awk -F_ -v OFS=', ' '
         NR==FNR {h[++c]=$1; t[c]=$2; next}
         FNR==1  {if(!data) {
                    printf "%s", "file_name";
                    for(i=1;i<=c;i++)  printf "%s", OFS h[i]"_"t[i];
                    print ""}
                  else pr()}

         FNR>1   {avg[$1]=$2; max[$1]=$3; data=1}

         END     {pr()}

         function pr() {
             printf "%s", FILENAME;
             for(i=1;i<=c;i++)  printf "%s", OFS (t[i]=="avg"?avg[h[i]]:max[h[i]])
             print ""}' order.csv FS=', *' values1.csv 

file_name, DD_avg, ZZ_avg, ZZ_max, TT_avg, TT_max, UU_avg, JJ_avg
values1.csv, 3, 6, 8, 3, 5, 3, 1

add other file names after values1.csv

randomir · Answer 4 · 2017-07-27T01:51:38.097

This looks like a job for Python. At least if you want to properly parse CSVs (with quoted fields, multiline fields, fields containing commas, etc.), to gracefully handle missing columns, to support variable number of columns per file, columns in different order in each file, different subsets of columns per file, etc.

Here's a Python 2/3 script that reads columns selection and order from the first file supplied as the first argument to script and then "values files" from the remaining arguments. Selected rows and columns (in order) are printed to standard output (so you can redirect them to a file). For greater robustness in handling weird field values (line multiline), you'll want to use a csv.writer instead.

#!/usr/bin/python
import sys
import csv
from collections import defaultdict

with open(sys.argv[1], 'r') as csvfile:
    # AA_avg, BB_max lines -> [['AA', 'avg'], ['BB', 'max]]
    order = list(csv.reader(csvfile, delimiter='_'))

# output header
print(','.join(["file_name"] + ["{}_{}".format(*o) for o in order]))

for filename in sys.argv[2:]:
    with open(filename, 'r') as csvfile:
        # read all values in a 2D associative map
        reader = csv.DictReader(csvfile, skipinitialspace=True)
        values = defaultdict(dict)
        for row in reader:
            item = row[reader.fieldnames[0]]
            for field in reader.fieldnames[1:]:
                values[item][field] = row[field]

    # select and print only the ones from order list
    line = [filename] + [values[item].get(field,'N/A') for item,field in order if item in values]
    print(','.join(line))

Usage:

$ python reorder.py order.csv values1.csv values2.csv
file_name,DD_avg,ZZ_avg,ZZ_max,TT_avg,TT_max,UU_avg,JJ_avg
values1.csv,3,6,8,3,5,3,1
values2.csv,6,8,6,5,3,3,5

Extract specific cells from csv file with AWK, and order according to a predefined order

4 Answers4