1

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.

The awk script I use for creating the array is as follows:

array=($(awk 'NR>1 { print $1 }' < file.tsv))

Then, to use this array in the other awk script

awk var="${array[@]}"  'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt

However, when I run this, the following error occurs.

awk: fatal: cannot open file 'foo' for reading (No such file or directory)

I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.

awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt

I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.

Thx in advance.

MK-1
  • 13
  • 2
  • You can not pass an array to a child process. You can only pass individual strings. This is a design restriction of Linux (and most, if not all, other operating systems), where the only kind of value which can be "understood" by all processes is a string. To simulate passing an array, you would have to first serialize it into a string and then deserialize it on the receiveing side. – user1934428 Oct 14 '22 at 08:56

3 Answers3

1

That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.

Here's how to really do what you're trying to do:

array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )

awk -v xyz="${array[*]}" '
    BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
    ... now use `var` as you were trying to ...
'

For example:

$ cat file.tsv
col1    col2
a b     c d e
f g h   i j

$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j

$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h

$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )

$ awk -v xyz="${array[*]}" '
    BEGIN {
        split(xyz,tmp,RS)
        for (i in tmp) {
            var[tmp[i]]
        }
        for (idx in var) {
            print "<" idx ">"
        }
    }
'
<f g h>
<a b>
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

It's easier and more efficient to process both files in a single awk:

edit: fixed issues in comment, thanks @EdMorton

awk '
    FNR == NR {
        if ( FNR > 1 )
            var[$1]
        next
    }
    FNR == 1 {
        for (i = 1; i <= NF; i++)
            heading[i] = $i
        next
    }
    {
        for (i = 2; i <= NF; i++)
            if ( $i == "1" && heading[i] in var) {
                outFile = heading[i] ".txt"
                print ">kmer" (NR-1) "\n" $1 >> (outFile)
                close(outFile)
            }
    }
' file.tsv input.txt
Fravadona
  • 13,917
  • 1
  • 23
  • 35
  • 1
    You should do `close(outfile)` after the write rather than before it so you don't leave the loop having left the last output file open. I think that first block should probably be `FNR == NR { if ( FNR > 1 ) { var[$1] } next }` so the `next` isn't skipped for the first line of the first file but its hard to say without sample input/output. – Ed Morton Oct 14 '22 at 12:42
  • 1
    @EdMorton my bad you’re right; I didn’t give it much thoughts when writing the first block. The rest is a copy paste of OP’s code; not sure that’s even worth to close the files – Fravadona Oct 14 '22 at 14:02
  • 1
    closing the files is worthwhile in case you hit the "too many open files" limit (e.g. https://stackoverflow.com/q/32878146/1745001 and https://stackoverflow.com/q/45285560/1745001). I'm not really following the logic of the script (haven't thought about it much) so idk if that can become an issue or not for this script. – Ed Morton Oct 14 '22 at 14:21
  • @EdMorton the number of output files is limited to `NF`, so it’s probably not so big to hit a limit – Fravadona Oct 14 '22 at 14:33
  • 1
    Depends if you're using the default awk on Solaris or not as [it's limit on concurrently open files is 10](https://stackoverflow.com/q/45285560/1745001). I don't usually consider that awks limitations (it is old, broken awk after all) but since closing files robustly is such a simple thing, no harm in doing it IMHO. – Ed Morton Oct 14 '22 at 14:37
  • 1
    @Fravadona : to be fair to Ed, i just threw a `1.1 GB` mp4 into `mawk2`, and tell it to split fields byte-by-byte - and `NF` came out to be `NF = 1,140,275,285` - `1.14` ***billion*** – RARE Kpop Manifesto Oct 20 '22 at 19:07
0

You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be

A B C
D E F
G H I

and file2.txt content be

1
3
2

then

var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt

gives output

First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D

Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.

(tested in gawk 4.2.1)

Daweo
  • 31,313
  • 3
  • 12
  • 25