-1

I'm trying to loop an AWK script that contains two conditions and a variable coming from a stated list. The purpose is to extract the line when the column one and column three meet two particular conditions (the name of the text in the two columns has to partially match) My input file is made this way:

pop1_io 1   pop1_ei 2   1   62027313    63797977    3.047
pop1_eg 1   pop2_yu 2   1   74240214    78974955    3.827
pop3_ab 1   pop1_zx 2   1   160604473   163511425   4.04

The first script I wrote works perfectly if I write manually the name of the variable I need, but it doesn't work if I try to loop it and insert variables into the awk script. Working one:

awk '{if ($1 ~ /pop1/ && $3 ~ /pop1/)
    print $1"\t" $2 "\t" $3 "\t" $4"\t" $5 "\t" $6 "\t" $7 "\t" $8}' inputfile.ibd | sed -r '/^\s*$/d' > pop1.ibd

Not working ones:

pops="pop1 pop2 pop3"

for pop in $pops
do
awk '{if ($1 ~ /$pop/ && $3 ~ /$pop/)
    print $1"\t" $2 "\t" $3 "\t" $4"\t" $5 "\t" $6 "\t" $7 "\t" $8}' inputfile.ibd | sed -r '/^\s*$/d' > out.$pop.ibd
done

This first script doesn't print anything. My second attempt is this:

for pop in $pops
do
awk '{if (a[$1]=~$pop && a[$3]=~$pop)
    print $1"\t" $2 "\t" $3 "\t" $4"\t" $5 "\t" $6 "\t" $7 "\t" $8}' Roma_Czech.ibdne.ibd | sed -r '/^\s*$/d' > out.$pop.ibd
done

In this case it prints everything contained in the first file. I could I fix this script?

Gf.Ena
  • 19
  • 6
  • 1
    please update the question with the expected output of the `for` loop – markp-fuso Jun 21 '23 at 15:03
  • In very large part this is duplicative of [How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script); arguably all the other bugs should be separate questions, making the question overbroad insofar as it goes beyond a single issue. – Charles Duffy Jun 21 '23 at 15:28
  • BTW, it's better practice to store lists as arrays rather than strings. `pops=( pop1 pop2 pop3 )` and then `for pop in "${pops[@]}"` -- that way you can iterate over strings with spaces, strings that have wildcard characters, and other things where unquoted expansions don't do the right thing. – Charles Duffy Jun 21 '23 at 15:35
  • The sed section is used to remove spaces or tabs in a "empty" line. – Gf.Ena Jun 22 '23 at 12:14

3 Answers3

4

A few issues with the current code:

  • to use OS (eg, bash) variables in an awk script use the -v awk_var="$bash_var" construct
  • =~ is an invalid operator in awk
  • you can define the output field separator as a tab (OFS="\t") so that you don't need to add an explicit "\t" between each output field
  • the references to a[$1] and a[$3] don't make sense in this case since the array a[] is never created let alone populated
  • while the current definition of pops works in this case you may want to consider using an array

Making some changes to OP's current code:

pops=('pop1' 'pop2' 'pop3')

for pop in "${pops[@]}"
do
    awk -v pop="$pop" 'BEGIN {OFS="\t"} ($1~pop && $3~pop) {$1=$1; print}' inputfile.ibd > "out.$pop.ibd"
done

NOTES:

  • assumes the input file has 8 space-delimited fields
  • the $1=$1 causes the line to be parsed so that the print can make use of the new OFS="\t"
  • I'm not sure of OP's purpose of the sed -r; I'm leaving it out but OP can add back into the mix as needed

This generates:

pop1_io 1       pop1_ei 2       1       62027313        63797977        3.047

Assuming the only purpose of this for loop is to print out the matching rows from the input file then we can push the looping construct down into a single awk script, eg:

poplist='pop1:pop2:pop3'                     # build a list of ":" delimited strings

awk -v poplist="${poplist}" '
BEGIN { OFS="\t"
        n=split(poplist,pops,":")            # split the "poplist" variable on the ":" delimiter and place results in the pops[] array
      } 
      { for (i=1;i<=n;i++)                   # loop through indices of the pops[] array
            if ($1~pops[i] && $3~pops[i]) {
               $1=$1
               print > ("out." pops[i] ".ibd")
               next
            }
      }
' inputfile.ibd

This also generates:

pop1_io 1       pop1_ei 2       1       62027313        63797977        3.047
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
1

I think this might be what you're trying to do, using any awk:

$ awk '
    {
        for (i=3; i>=1; i-=2) {
            key = $i
            sub(/_.*/,"",key)
            out = key ".ibd"
            if ( !seen[key]++ ) {
                printf "" > out
            }
        }
    }
    $3 ~ ("^" key) {
        print > out
    }
' file

$ head *.ibd
==> pop1.ibd <==
pop1_io 1   pop1_ei 2   1   62027313    63797977    3.047

==> pop2.ibd <==

==> pop3.ibd <==

Note that you don't need to provide a list like pop1 pop2 pop3, the tool just creates an output file for each of those prefixes that exist in the input. If you hit a "too many open files" error message then change it to the following which will be a bit slower as it's closing the output after every write:

$ awk '
    {
        for (i=3; i>=1; i-=2) {
            key = $i
            sub(/_.*/,"",key)
            out = key ".ibd"
            if ( !seen[key]++ ) {
                printf "" > out
            }
        }
    }
    $3 ~ ("^" key) {
        print >> out
        close(out)
    }
' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1
awk -F'_| *' -v list=pop1,pop2,pop3 '
    BEGIN{
        n=split(list,arr,",")
        for(i=1; i<=n; i++) pops[arr[i]] 
    }
    $1==$4 && $1 in pops { print $0 > ("out." $1 ".ibd")}
' file
ufopilot
  • 3,269
  • 2
  • 10
  • 12
  • `print $0 > "out." $1 ".ibd"` will give you a syntax error in some awks, you need parens around any expression on the right side of input or output redirection for it to be portable to all awks - `print $0 > ("out." $1 ".ibd")`. You don't actually need `$0` in there of course - `print > ...`. You might also run into a "too many open files" error if you're generating many output files and not using GNU awk, e.g. see [error-awk-too-many-output-files-10-when-splitting-ssl-certificates](https://stackoverflow.com/questions/45285560/error-awk-too-many-output-files-10-when-splitting-ssl-certificates) – Ed Morton Jun 22 '23 at 14:18