1

I want to extract certain columns from a file based on the header name. Some of the names have spaces (the file is tab-delimited). I cannot remove or substitute the spaces since downstream applications will be affected. What I am looking for is to extract some column based on header names and output them in the order I desire

Here is an example of my file:

 Sample Note Intragenic Rate ABCDE_177447
 1032 NA 0.97867626 0.9300704670625763 0.72782564
 ABCDE_177447 NA 0.97836965 1.0 0.87218356 
 ABCDE_188399 NA 0.97859967 0.905527730405171 0.81188565
 ABCDE_189595 NA 0.9787659 0.9059075892313707 0.8089241
 ABCDE_189596 NA 0.9788054 0.9065243881070291 0.8092951

My desired output;

 Sample Intragenic ABCDE_177447
 1032 0.97867626 0.9300704670625763 0.72782564
 ABCDE_177447 0.97836965 0.87218356 
 ABCDE_188399 0.97859967 0.81188565
 ABCDE_189595 0.9787659 0.8089241
 ABCDE_189596 0.9788054 0.8092951

I have tried this solution AWK extract columns from file based on header selected from 2nd file

but it doesn't work with spaces on names and the same for this solution Extracting columns from a file

I also have tried this

$cat cols.awk

BEGIN {
 n=split(cols,col)
 for (i=1; i<=n; i++) s[col[i]]=i
}
NR==1 {
 for (f=1; f<=NF; f++)
   if ($f in s) c[s[$f]]=f
 next
}
{ sep=""
 for (f=1; f<=n; f++) {
   printf("%c%s",sep,$c[f])
   sep=FS
 }
 print ""
}

But when I ran my script like awk -F\t -f cols.awk.sh -v cols="Note,Sample,Intragenic Rate" metrics.txt I got the following error:

awk: illegal field $(), name "1"
 input record number 2, file metrics.txt
 source line number 12
Community
  • 1
  • 1
user2380782
  • 1,542
  • 4
  • 22
  • 60

2 Answers2

1

You can use this awk:

awk -v cols='Sample,Intragenic,ABCDE_177447' 'BEGIN{FS=OFS="\t"; nc=split(cols, a, ",")} NR==1{for (i=1; i<=NF; i++) hdr[$i]=i} {for (i=1; i<=nc; i++) if (a[i] in hdr) printf "%s%s", $hdr[a[i]], (i<nc?OFS:ORS)}' file

Sample  Intragenic  ABCDE_177447
1032    0.97867626  0.72782564
ABCDE_177447    0.97836965  0.87218356
ABCDE_188399    0.97859967  0.81188565
ABCDE_189595    0.9787659   0.8089241
ABCDE_189596    0.9788054   0.8092951

Using cols command line arguments you can pass a comma delimited list of columns to print.

Here is more readable awk:

awk -v cols='Sample,Intragenic,ABCDE_177447' 'BEGIN {
   FS=OFS="\t"
   nc=split(cols, a, ",")
}
NR==1 {
   for (i=1; i<=NF; i++)
      hdr[$i]=i
}
{
   for (i=1; i<=nc; i++)
      if (a[i] in hdr)
         printf "%s%s", $hdr[a[i]], (i<nc?OFS:ORS)
}' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • thanks @anubhava, could you tell me how to do it for including command line arguments like yourfile.sh "cols1,cols 2" inputfile? – user2380782 Apr 07 '16 at 09:31
  • Inside `yourfile.sh` you can save this awk command: `awk -v cols="$1" 'BEGIN{FS=OFS="\t"; nc=split(cols, a, ",")} NR==1{for (i=1; i<=NF; i++) hdr[$i]=i} {for (i=1; i<=nc; i++) if (a[i] in hdr) printf "%s%s", $hdr[a[i]], (i – anubhava Apr 07 '16 at 09:46
0
awk '{sub(/Note Intragenic Rate/,"Intragenic")}{sub(/NA/, "")}NR>2{sub($3, "")}1' file

Sample Intragenic ABCDE_177447
1032  0.97867626 0.9300704670625763 0.72782564
ABCDE_177447  0.97836965  0.87218356 
ABCDE_188399  0.97859967  0.81188565
ABCDE_189595  0.9787659  0.8089241
ABCDE_189596  0.9788054  0.8092951
Claes Wikner
  • 1,457
  • 1
  • 9
  • 8