Can AWK transpose this data faster than a bash for loop?

Question

Here is my sample data:

DATA='target1.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:49
target2.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:30
target3.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:49
target1.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target2.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target3.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target1.domain,LAST_VULN_CRED_SCAN,2022/12/07 04:59:06
target2.domain,LAST_VULN_CRED_SCAN,2022/12/07 04:59:06
target3.domain,LAST_VULN_CRED_SCAN,2022/12/07 03:03:52'

Here is my desired output:

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52

Here is my current for loop:

UNIQUETARGETS=$(echo "${DATA}" | cut -d , -f 1 | sort | uniq)

echo 'Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN'
for TARGET in $UNIQUETARGETS; do
    LAST_VULN_AGENT_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_AGENT_SCAN' | cut -d , -f 3)
    LAST_VULN_NONCRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_NONCRED_SCAN' | cut -d , -f 3)
    LAST_VULN_CRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_CRED_SCAN' | cut -d , -f 3)
    echo "${TARGET},${LAST_VULN_AGENT_SCAN},${LAST_VULN_NONCRED_SCAN},${LAST_VULN_CRED_SCAN}"
done

While this method works, I am confident that AWK can do this much faster. I've searched for several "transpose" awk snippets, but nothing I've found can do quite what I want. Any help is greatly appreciated!

Why is your input stored in a shell variable rather than a file or incoming pipe stream? That usually indicates some problem earlier in your script. — Ed Morton, Dec 07 '22 at 21:22
That's an _incredibly_ slow loop; it would be trivial to make a far faster one in 100% bash (but also probably straightforward to make awk faster than the alternative, efficient, loop). — Charles Duffy, Dec 07 '22 at 21:39
If you want to write _fast_ code, don't ever put a command substitution (`$(...anything...)`) inside a loop. Likewise, no pipes. You don't _need_ either of those constructs; your code doesn't do anything that you couldn't use native, built-in shell features for, avoiding `grep` altogether. — Charles Duffy, Dec 07 '22 at 21:40
Similarly, [DontReadLinesWithFor](https://mywiki.wooledge.org/DontReadLinesWithFor); `for TARGET in $anything` is inherently broken: when you want to iterate over things with `for`, those things should be in an array, or should be glob expansion results. — Charles Duffy, Dec 07 '22 at 21:42
(as another note, all-caps variable names are used for variables meaningful to the shell, whereas names with at least one lowercase name are reserved for application use and so are what you should use in your scripts; see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html for the relevant conventions) — Charles Duffy, Dec 07 '22 at 21:44
Was just about to post this so... See also [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) — Ed Morton, Dec 07 '22 at 21:45
@EdMorton The input is a file, I just wrote it that way for testing — 27560, Dec 07 '22 at 22:11
@27560 don't do that as it raises questions and can lead to worse solutions. If your real input is a file then show it as a file in your question. — Ed Morton, Dec 07 '22 at 22:31

score 6 · Accepted Answer · answered Dec 07 '22 at 21:38

Using any awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    names[$1]
    times[$1,$2] = $3
}
END {
    hdr = "Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
    print hdr

    numCols = split(hdr,scans)

    for ( name in names ) {
        printf "%s%s", name, OFS
        for ( colNr=2; colNr<=numCols; colNr++ ) {
            scan = scans[colNr]
            time = times[name,scan]
            printf "%s%s", time, (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06

If you care about the order of output lines they can be sorted in whatever order you like.

dawg · Answer 2 · 2022-12-07T22:55:17.160

I would use ruby with its csv parser for this personally.

ruby -r csv -e '
options={ :headers=>false }
inp=CSV.parse($<.read, **options)
names=inp.map{|sa| sa[0]}.uniq.sort
header=["Name"]+inp.map{|sa| sa[1]}.uniq.sort
data=inp.each_with_object(Hash.new {|h,k| h[k]={} }) { 
        |(name, func, date), h|
        h[name][func]=date
}
table=CSV::Table.new([], headers: header)
names.each{ |n| line=[n]; header[1..].each { |h| line << data[n][h] }
    table << line }
puts table
' file

Prints:

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_CRED_SCAN,LAST_VULN_NONCRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 04:59:06,2022/12/07 00:08:43
target2.domain,2022/12/07 03:14:30,2022/12/07 04:59:06,2022/12/07 00:08:43
target3.domain,2022/12/07 03:14:49,2022/12/07 03:03:52,2022/12/07 00:08:43

score 0 · Answer 3 · answered Dec 07 '22 at 22:42

If you just wanted to do it in bash the logic still works with very few tweaks required.

declare -A names=() v=();  
while IFS=, read n c t; do names[$n]=1; v["$n,$c"]="$t"; done < test.in
echo "Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
for n in "${!names[@]}"
do echo "$n,${v[$n,LAST_VULN_AGENT_SCAN]},${v[$n,LAST_VULN_NONCRED_SCAN]},${v[$n,LAST_VULN_CRED_SCAN]}"
done

or

hdr="Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
echo "$hdr"; IFS=,  read -a col <<< "$hdr"
declare -A names=() v=();  
while IFS=, read n c t; do names[$n]=1; v["$n,$c"]="$t"; done < test.in
for n in "${!names[@]}"; do echo "$n,${v[$n,${col[1]}]},${v[$n,${col[2]}]},${v[$n,${col[3]}]}"; done

The output -

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06

And as mentioned, you could always add sort.
If that last line were

for n in "${!names[@]}"; do echo "$n,${v[$n,${col[1]}]},${v[$n,${col[2]}]},${v[$n,${col[3]}]}"; done | sort

then the output becomes

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52

But awk is still smarter.

jpseng · Answer 4 · 2022-12-09T11:08:59.023

Maybe you want to compare against jq. On my machine it runs four times faster than your bash solution

#!/bin/bash

#JQ='jq_1.5'
JQ='jq_1.6'
RUNS=100

JQ_CODE='
  split("\n") |               # split lines by "\n"
  map(select(length > 0) |    # remove empty lines
      split(",")) |           # split each line by ","
  group_by(.[0]) |            # group by target
  map([ .[0][0],                                              # first field: target
        map(select(.[1] == "LAST_VULN_AGENT_SCAN")[2])[],     # second field
        map(select(.[1] == "LAST_VULN_NONCRED_SCAN")[2])[],   # third field
        map(select(.[1] == "LAST_VULN_CRED_SCAN")[2])[] ] |   # forth field
      join(", ")) |                                           # join fields by ", "
  "Name, LAST_VULN_AGENT_SCAN, LAST_VULN_NONCRED_SCAN, LAST_VULN_CRED_SCAN",    # print header
  .[]                                                                           # print lines
'

echo "---- jq: execude code -----"

jq -Rrs "$JQ_CODE" <<< "$DATA"

echo -e "\n---- Compare speed -----"

echo -e "\n1) bash: speed"
time for ((i = 0 ; i < $RUNS ; i++)); do
  UNIQUETARGETS=$(echo "${DATA}" | cut -d , -f 1 | sort | uniq)

  echo 'Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN' > /dev/null
  for TARGET in $UNIQUETARGETS; do
      LAST_VULN_AGENT_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_AGENT_SCAN' | cut -d , -f 3)
      LAST_VULN_NONCRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_NONCRED_SCAN' | cut -d , -f 3)
      LAST_VULN_CRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_CRED_SCAN' | cut -d , -f 3)
      echo "${TARGET},${LAST_VULN_AGENT_SCAN},${LAST_VULN_NONCRED_SCAN},${LAST_VULN_CRED_SCAN}" > /dev/null
  done
done

echo -e "\n2) jq: speed"
time for ((i = 0 ; i < $RUNS ; i++)); do
  $JQ -Rrs "$JQ_CODE" <<< "$DATA" > /dev/null
done

echo -e "\n3) jq: overhead"
time for ((i = 0 ; i < $RUNS ; i++)); do
  $JQ -Rrs '.' <<< "$DATA" > /dev/null
done

Output

---- jq: execude code -----
Name, LAST_VULN_AGENT_SCAN, LAST_VULN_NONCRED_SCAN, LAST_VULN_CRED_SCAN
target1.domain, 2022/12/07 03:14:49, 2022/12/07 00:08:43, 2022/12/07 04:59:06
target2.domain, 2022/12/07 03:14:30, 2022/12/07 00:08:43, 2022/12/07 04:59:06
target3.domain, 2022/12/07 03:14:49, 2022/12/07 00:08:43, 2022/12/07 03:03:52

---- Compare speed -----

1) bash: speed

real    0m2,362s
user    0m4,104s
sys     0m1,113s

2) jq: speed

real    0m0,622s
user    0m0,549s
sys     0m0,080s

3) jq: overhead

real    0m0,403s
user    0m0,364s
sys     0m0,046s

Can AWK transpose this data faster than a bash for loop?

4 Answers4