3

new to this website here: i have a problem analyzing data in a csv file.

I've written a little script that reads input from a csv file and prints only the desired fields:

awk -F, -v _sourcefile=$i -v title="\"${k}\"" -v box="_${j}_" -v score="$dock_score_column" -v hbond="${xp_terms_columns[0]}" -v electro="${xp_terms_columns[1]}" -v phoben="${xp_terms_col umns[2]}" -v phobenhb="${xp_terms_columns[3]}" -v lowmw="${xp_terms_columns[4]}" -v rotpenal="${xp_terms_columns[5]}" -v lipophilicevdw="${xp_terms_columns[6]}" -v phobenpairhb="${xp_terms_columns[7]}" -v sitemap="${xp_terms_columns[8]}" -v penalties="${xp_terms_columns[9]}" -v pistack="${xp_terms_columns[10]}" -v hbpenal="${xp_terms_columns[11]}" -v expospenal="${xp_terms_columns[12]}" -v picat="${xp_terms_columns[13]}" -v clbr="${xp_terms_columns[14]}" -v zpotr="${xp_terms_columns[15]}"
'BEGIN{format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"} $title_column ~ title && $source_column ~ _sourcefile && $source_column ~ box
{printf format, $score,"= ", $hbond, $electro, $phoben, $phobenhb, $lowmw, $rotpenal, $lipophilicevdw, $phobenpairhb, $sitemap, $penalties, $pistack, $hbpena l, $expospenal, $picat, $clbr, $zpotr}' $file


It's a complete mess but for now it does what i need to.
Question is: how can i make it simpler, by feeding it the fields stored inside {xptermscolumns[@]}?

The file is normal csv file, and the first part of the awk script just looks for the right records to print, my only problem is with the 16 different variables i have to declare to print.

I've tried using arrays inside awk like:

awk -F, -v _sourcefile=$i -v title="\"${k}\"" -v box="_${j}_" -v terms="$xp_terms_columns" 'BEGIN{split(terms, array, " ")} $title_column ~ title && $source_column ~ _sourcefile && $sour ce_column ~ box { n=asorti(array, sorted); for (i=1;i<=n;i++) printf " " $sorted[i] }' $file

But without success because i couldn't make asorti print the fields in the correct order.


Here is the first script above written legibly to help with this question and as an example for the OP to follow in future:

awk -F,                                         \
    -v _sourcefile="$i"                         \
    -v title="\"${k}\""                         \
    -v box="_${j}_"                             \
    -v score="$dock_score_column"               \
    -v hbond="${xp_terms_columns[0]}"           \
    -v electro="${xp_terms_columns[1]}"         \
    -v phoben="${xp_terms_columns[2]}"          \
    -v phobenhb="${xp_terms_columns[3]}"        \
    -v lowmw="${xp_terms_columns[4]}"           \
    -v rotpenal="${xp_terms_columns[5]}"        \
    -v lipophilicevdw="${xp_terms_columns[6]}"  \
    -v phobenpairhb="${xp_terms_columns[7]}"    \
    -v sitemap="${xp_terms_columns[8]}"         \
    -v penalties="${xp_terms_columns[9]}"       \
    -v pistack="${xp_terms_columns[10]}"        \
    -v hbpenal="${xp_terms_columns[11]}"        \
    -v expospenal="${xp_terms_columns[12]}"     \
    -v picat="${xp_terms_columns[13]}"          \
    -v clbr="${xp_terms_columns[14]}"           \
    -v zpotr="${xp_terms_columns[15]}"          \
'
    BEGIN {
        format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
    }
    ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) {
        printf format, $score, "= ", $hbond, $electro, $phoben, $phobenhb, $lowmw,              \
                        $rotpenal, $lipophilicevdw, $phobenpairhb, $sitemap, $penalties,        \
                        $pistack, $hbpenal, $expospenal, $picat, $clbr, $zpotr
    }
' "$file"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
Gioele
  • 43
  • 7
  • `awk` can easily read 2 (or more) files, so you can maybe put all the parameters and fields in a second (parameters) file more simply https://stackoverflow.com/a/21875931/2836621 – Mark Setchell Mar 14 '23 at 09:50
  • @MarkSetchell This seems like a good idea, never thought of that. Could you please expand a little on it? It's a little too advanced for me. – Gioele Mar 14 '23 at 10:14
  • There are lots of examples on SO, put `[awk] FNR` in the search box. Nice example here https://stackoverflow.com/a/42851648/2836621 – Mark Setchell Mar 14 '23 at 10:26
  • current code references columns `$title_column` and `$source_column` but I don't see any definitions/assignments for `title_column` and `source_column`; is this merely a typo (eg, `-v title_column=...` and `-v source_column=...` were mistakingly left out of this copy of the code)? – markp-fuso Mar 14 '23 at 15:44
  • @markp-fuso Yes they are other variables declared inside the script to find additional fields to look for, there's probably a cleaner way to do that but my problem is focused on the 16 different variables declaration. – Gioele Mar 15 '23 at 13:36

3 Answers3

7

Try this (untested), using any awk:

awk -F,                                         \
    -v _sourcefile="$i"                         \
    -v title="\"${k}\""                         \
    -v box="_${j}_"                             \
    -v score="$dock_score_column"               \
    -v xp_terms_columns="${xp_terms_columns[*]}" \
'
    BEGIN {
        split(xp_terms_columns,xp," ")
        hbond           = xp[1]
        electro         = xp[2]
        phoben          = xp[3]
        phobenhb        = xp[4]
        lowmw           = xp[5]
        rotpenal        = xp[6]
        lipophilicevdw  = xp[7]
        phobenpairhb    = xp[8]
        sitemap         = xp[9]
        penalties       = xp[10]
        pistack         = xp[11]
        hbpenal         = xp[12]
        expospenal      = xp[13]
        picat           = xp[14]
        clbr            = xp[15]
        zpotr           = xp[16]

        format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
    }
    ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) {
        printf format, $score, "= ", $hbond, $electro, $phoben, $phobenhb, $lowmw,              \
                        $rotpenal, $lipophilicevdw, $phobenpairhb, $sitemap, $penalties,        \
                        $pistack, $hbpenal, $expospenal, $picat, $clbr, $zpotr
    }
' "$file"

The above assumes you have some reason for printing 16 individual fields instead of just a range or all the fields or the fields after some point in the input or something else.

It also assumes the problem you're trying to solve is setting 16 awk variables with -v from the shell array as opposed to having 16 awk variables.

This might actually be all you need (again untested):

awk -F,                                         \
    -v _sourcefile="$i"                         \
    -v title="\"${k}\""                         \
    -v box="_${j}_"                             \
    -v score="$dock_score_column"               \
    -v xp_terms_columns="${xp_terms_columns[*]}" \
'
    BEGIN {
        nxp  = split(xp_terms_columns,xp," ")
        nfmt = split("%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s",fmt," ")
        if ( nxp != nfmt ) {
            print "field vs format count mismatch" | "cat>&2"
            exit 1
        }
    }
    ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) {
        printf "%-8s =", $score
        for ( i=1; i<=nxp; i++ ) {
            printf ("%s" fmt[i]), OFS, $(xp[i])
        }
        print ""
    }
' "$file"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • I think your answer actually answered 100% my problem, since my problem is printing exactly those 16 fields exactly in that order, so if the csv changes for some reason the numbers printed will remain the exact ones in the exact order. Could you please explain why you split into array into xp and then into nxp? – Gioele Mar 15 '23 at 11:31
  • I'm passing a string to awk (the contents of the shell array `xp_terms_columns[]` as a blank-separated list of "words"), as it's impossible to pass a shell array to awk, so then I'm using `split()` to convert that string into an awk array `xp` so I can later loop on it's contents. `nxp` isn't another array, its the number of elements in `xp`. So `-v xp_terms_columns="${xp_terms_columns[*]}"` converts the shell array `xp_terms_columns[]` into a string and stores it in the awk variable `xp_terms_columns` and then `nxp = split(xp_terms_columns,xp," ")` converts that string into an awk array `xp` – Ed Morton Mar 15 '23 at 11:34
  • This solved my problem without a fuss, amazing! Just a quick question to follow tho: i'm using it modified according to my needs, if i want to modify your if check to check whether a particular element is equal to a number and not to a white space, how would you do it? `if ( 16 in xp == " " ) \ { printf "%-8s", $score \ exit 0 \ }` – Gioele Mar 15 '23 at 14:40
  • [Chameleon Questions](https://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions) are strongly discouraged on this forum, please ask a new question. – Ed Morton Mar 15 '23 at 14:41
1

I want to explain why your attempt failed, namely

-v terms="$xp_terms_columns"

did not work as you though, observe that

arr=("Able" "Baker" "Charlie")
echo $arr

gives output

Able
Daweo
  • 31,313
  • 3
  • 12
  • 25
1

unless you REALLY need those field names for some reason, assuming that shell array contains a bunch of column numbers you want printed out, then a simple sub-process could make life so much easier :

xp_terms_columns=( $( jot 127 | rev | shuf | rev | head -n 16 ) )

echo "\n\t ${xp_terms_columns[*]} | ${#xp_terms_columns[*]}\n"

date | gawk -p- -be '

BEGIN {
        format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
}
($title_column  ~ title)       && 
($source_column ~ _sourcefile) && 
($source_column ~ box) {

    printf( format, $score, "= ", $'"$( 

      awk NF=NF OFS=', $' ORS= <<< "$xp_terms_columns[*]" 

                                           )"') }'
     99 66 25 62 72 16 12 108 69 117 8 22 98 19 61 93 | 16
Tue Mar 14 23:27:03 EDT 2023 =                                                                                                                                                                          
    # gawk profile, created Tue Mar 14 23:27:03 2023

    # BEGIN rule(s)

    BEGIN {
     1      format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
    }

    # Rule(s)

     1  ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) { # 1
     1      printf format, $score, "= ", $99, $66, $25, $62, $72, $16, $12, $108, $69, $117, $8, $22, $98, $19, $61, $93
    }

And a clean piece of code is dynamically generated from that shell array of column numbers without having to fuss about them

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • I don't need those field names, i need them in the exact order i want them to be printed so your answer really helps me out! Could you please explain what $'"$( awk NF=NF OFS=', $' ORS= <<< "$xp_terms_columns[*]" )"') }' does? – Gioele Mar 15 '23 at 11:27
  • @Gioele : `<<< ….some….stuff…. `, by and large is just a shorthand for `echo ….some….stuff….`. The `*` is telling it to split out the array by the 1st character of `IFS`, which should be a single space unless one has customized it, which means each array element, in sequential order, would be placed in their own "field" (i.e. column) within `awk`…... – RARE Kpop Manifesto Mar 17 '23 at 00:28
  • @Gioele : …...`ORS= ` means I don't want trailing new line on output. `OFS=….` is requesting the output be delimited by this new string instead of the default, which is a single space. Finally, `NF=NF` instructs `awk` to convert all the input delimiters (or `"seps"` in `awk`-lingo) to `OFS`, plus output it. This approach, by and large, is much faster, since no loops rneeded, required column#s are now hard-coded into the script code. I've once tested the same concept by splitting a `1 GB` file into `1 billion` fields, 1 byte each, & print out select columns (like `$892391`). worked like a charm – RARE Kpop Manifesto Mar 17 '23 at 00:36
  • 1
    @Gioele : oh one more point : the front part `$'"$(` - the first `$` belonged inside outer layer `awk` code since this approach is 1 short, which is what that's for. The `'` immediately afterwards is for ending first half of outer `awk` code, which was single quoted, and switch to double-quoting for the substitution. This ***IS NOT*** the so called `ANSI C`-style quoting like `$'\357'` . – RARE Kpop Manifesto Mar 17 '23 at 00:43