2

I would like to use each line of a file, samples.txt, as a regular expression and print the entire column that matches this from input.txt.

samples.txt

aa
bb
cc

input.txt

s   aa    v    dd    jj    bb    ww    cc
1   1     1    1     2     3     3     8
3   5     4    5     2     7     5     8  

output.txt

aa    bb    cc
1     3     8
5     7     8

I can do these operations separately - reading each line in bash then using it as a regular expression, and separately using the regular expression to print the matching column, but I can not put them together. Any suggestions?

To print each matching column I can use:

awk 'NR==1 {for(i=1;i<=NF;i++) if ($i~/$line/) f=i;next} {print $f}' input.txt

And to iterate through the file for each line to use as a regular expression as above:

while read line; do echo $line; done < samples.txt

However I can't put these two together...

while read line; do
    awk 'NR==1 {for(i=1;i<=NF;i++) if ($i~/$line/) f=i;next} {print $f}' input.txt >> output.txt; done < samples.txt
user3324491
  • 539
  • 1
  • 4
  • 14
  • the awk command you are showing does not print anything and has some errors like `$i ~ /$line/`, which is a wrong syntax because no `line` is defined anywhere. – fedorqui Jun 09 '15 at 13:31
  • 1
    a) you would NOT use a shell loop to call awk multiple times. b) why do you want to use a regexp instead of string comparison? c) If you DO want a regexp comparison then update your example to show how that would work (e.g. would `a` in samples.txt match `ab` and `ca` in input.txt or would you need `a.*` or `a?` in samples.txt?). – Ed Morton Jun 09 '15 at 13:41

3 Answers3

3

I think it is easier to transpose the input.txt file, print those lines starting with the given words and then transpose back:

$ awk 'FNR==NR {a[$1]; next} $1 in a' samples <(transpose < input) | transpose
aa bb cc
1 3 8
5 7 8

This uses the awk 'FNR==NR {do_things; next} other_things' file1 file2 to perform do_things when reading file1 and other_things when reading file2.

In this case, we load all the names from samples into an array a[]. Then, we go through the input data and check if its first field is in the array. If so, the statement evaluates to True and the line is printed.

transpose is a function I used in another answer of mine:

transpose () {
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j<NR?OFS:ORS)
              }
        }'
}
Community
  • 1
  • 1
fedorqui
  • 275,237
  • 103
  • 548
  • 598
3

In awk

awk 'NR==FNR{a[$1]++;next}FNR==1{for(i=1;i<=NF;i++)b[i]=a[$i]}
            {for(i=1;i<=NF;i++)if(b[i])printf "%s\t",$i;print x}' {samples,input}.txt

aa      bb      cc
1       3       8
5       7       8

This basically collects the samples in an array, on the first file. Next on the first line of the second, compares each field to the samples and sets them to 1 if it is the same.

Then loops over each line only printing the fields that are set to one in the array.

To remove the trailing tab following (Kent|Fedorqui|Ed Morton)'s advice

awk 'NR==FNR{a[$1]++;next}FNR==1{for(i=1;i<=NF;i++)b[i]=a[$i]==1&&last=i}
     {for(i=1;i<=NF;i++)if(b[i])printf "%s",$i (i==last?ORS:OFS)}' {samples,input}.txt
123
  • 10,778
  • 2
  • 22
  • 45
  • good one +1. however your output has trailing space/tab. what you can do is, after `b[i]=a[$i]`, save `last=i`, then do `if(b[i])printf "%s%s",$i, last==i?RS:"\t"` – Kent Jun 09 '15 at 13:40
  • @Kent I think that's what you meant ? Edit it if it's still wrong :) – 123 Jun 09 '15 at 13:48
  • May be even better to say `i==last?ORS:OFS`. This way, you use the default record separator / field separator. – fedorqui Jun 09 '15 at 13:52
  • That's not doing a regexp comparison. How about some parens in `b[i]=a[$i]==1&&last=i` as right now I for one have no idea what that list of characters does. Also `(i – Ed Morton Jun 09 '15 at 13:58
  • @EdMorton I didn't say it did a regex comparison anywhere. The list of characters sets `b[field]` to `a[whats in field]` and compares them to `1` to check if it is set, if it is then it sets `last` to that field. Thanks for the `OFS/ORS` for the output though I'll put that in. – 123 Jun 09 '15 at 14:05
  • @fedorqui Yep, will definitely be better, thanks for the idea :) – 123 Jun 09 '15 at 14:05
  • @User112638726 I know you didn't say it did, but it's what the OP asked for multiple times in his subject line and description. Unfortunately he didn't provide sample input/output that would produce different output for a regexp instead of string comparison and he's obviously not familiar with awk so he probably doesn't realize that this script isn't doing what he asked for since it is producing the expected output from his sample input. – Ed Morton Jun 09 '15 at 14:09
  • @EdMorton Oh, i took regular expression just to mean that he wanted to match string based on the context of the rest of the question. If they change the question to be clearer or confirm that they want regex, I'll edit the answer. – 123 Jun 09 '15 at 14:11
1

If you do want a regexp comparsion then it's:

$ cat tst.awk
NR==FNR { colNames=(NR>1 ? colNames "|" : "") $0; next }
FNR==1 {
    numCols = 0
    for (i=1; i<=NF; i++) {
        if ( $i ~ "("colNames")" ) {
            colNrs[++numCols] = i
        }
    }
}
{
    for (i=1; i<=numCols; i++) {
        printf "%s%s", $(colNrs[i]), (i<numCols?OFS:ORS)
    }
}

$ awk -f tst.awk samples.txt input.txt
aa bb cc
1 3 8
5 7 8

If instead you actually want a string comparison then:

$ cat tst2.awk
NR==FNR { colNames[$0]; next }
FNR==1 {
    numCols = 0
    for (i=1; i<=NF; i++) {
        if ( $i in colNames ) {
            colNrs[++numCols] = i
        }
    }
}
{
    for (i=1; i<=numCols; i++) {
        printf "%s%s", $(colNrs[i]), (i<numCols?OFS:ORS)
    }
}

$ awk -f tst2.awk samples.txt input.txt
aa bb cc
1 3 8
5 7 8

To run it on multiple input files just list them all at the end of the awk command line, do not write a shell loop to call awk multiple times.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185