0

I am trying to merge hundreds of samples files that contain species names and proportions into one file in long-format using bash script. I wonder how to add some characters at the beginning of each line of awk output.

I have some sampleID that I saved in the variable $STEM. I used awk to get the species names and proportions from each file. Proportion is at the beginning of each line; species name is at the end (6th place) of each line (tab-separated). But I also want to add the sampleID ($STEM) to the beginning of each line in the output file. Here is my code:

for file in $input_dir/*_species_abundance.txt
do
        STEM=$(basename "$file" _species_abundance.txt )
        echo "processing sample $STEM"
        awk '{print "$STEM," $1,$6}' FS='\t' $file >> $input_dir/merged_species_abundance.txt

done

The "$STEM," part doesn't work as expected, because the current output is "$STEM" instead of substituting it with the sampleID.

Do you have any suggestions on how I can modify my code? Thank you in advance!

Here is some sample input:

  0.45  124078  0       S       148633                s__Faecalibacterium prausnitzii_D
  0.35  95476   0       S       145938                s__Faecalibacterium prausnitzii_C
  0.21  57002   0       S       158191                s__Faecalibacterium prausnitzii_I
  0.18  49503   0       S       224832                s__Faecalibacterium sp900539945
  0.07  18991   0       S       157095                s__Faecalibacterium prausnitzii_G
  0.04  12007   0       S       187396                s__Faecalibacterium prausnitzii_F
...
... 

The first number is the proportion, and the last word is the species name.

The sampleID is something like 1001, 1002, 1003, ...

My desired output would be (comma-separate):

1001,0.45,s__Faecalibacterium prausnitzii_D
1001,0.35,s__Faecalibacterium prausnitzii_C
1001,0.21,s__Faecalibacterium prausnitzii_I
...
1002,0.28,s__Faecalibacterium prausnitzii_D
1002,0.00,s__Faecalibacterium prausnitzii_C
1002,0.01,s__Faecalibacterium prausnitzii_I
...
1003,0.60,s__Faecalibacterium prausnitzii_D
1003,0.02,s__Faecalibacterium prausnitzii_C
1003,0.39,s__Faecalibacterium prausnitzii_I
...
...
vicky
  • 17
  • 4
  • 2
    `TLDR` but from a simple glance of your `awk` code, it just need a `-v` flag to assign the shell variable. `awk -v var="$STEM" '{print var, $1,$6} .... ` – Jetchisel Mar 12 '21 at 02:01
  • 2
    Does [this](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script/19075707#19075707) answer your question? BTW, I'd also recommend switching to lower- or mixed-case shell variables, to avoid accidental conflicts with the many all-caps names that have special meaning to the shell and/or some commands. – Gordon Davisson Mar 12 '21 at 02:16
  • @GordonDavisson Thank you! I was searching for -v option in awk command. The link is very helpful. And thank you for the reminder. My colleague used STEM as the variable name, so I just kept using it... I will definitely keep your tip in mind! – vicky Mar 12 '21 at 02:28

1 Answers1

2

I think this is what you're looking for:

input_dir=mydir;
for file in $input_dir/*_species_abundance.txt;
do
    STEM=$(basename "$file" _species_abundance.txt );
    echo "processing sample $STEM";
    awk '{print '$STEM' "," $1 "," $6 " " $7}' $file; >> $input_dir/merged_species_abundance.txt
done

The key to printing the value of the shell environment variable, $STEM, is to let the shell evaluate it by "placing it outside the single quotes", '. Then, awk gets its value.

This is the output generated:

processing sample 1001
processing sample 1002
processing sample 2001
processing sample 2002
$ cat mydir/merged_species_abundance.txt
1001,0.45,s__Faecalibacterium prausnitzii_D
1001,0.35,s__Faecalibacterium prausnitzii_C
1001,0.21,s__Faecalibacterium prausnitzii_I
1001,0.18,s__Faecalibacterium sp900539945
1001,0.07,s__Faecalibacterium prausnitzii_G
1001,0.04,s__Faecalibacterium prausnitzii_F
1002,0.45,s__Faecalibacterium prausnitzii_D
1002,0.35,s__Faecalibacterium prausnitzii_C
1002,0.21,s__Faecalibacterium prausnitzii_I
1002,0.18,s__Faecalibacterium sp900539945
1002,0.07,s__Faecalibacterium prausnitzii_G
1002,0.04,s__Faecalibacterium prausnitzii_F
Luis Guzman
  • 996
  • 5
  • 8
  • This depends on the sample ID being numeric; passing it as an awk variable (as in @Jetchisel's comment) is much safer. – Gordon Davisson Mar 12 '21 at 02:15
  • Thank you! the single quote works! I will try to pass it as awk variable as well. May I ask why using awk variable is much safer? – vicky Mar 12 '21 at 02:16
  • @GordonDavisson, it doesn't depend on the sample ID being numeric. It can include text. I just tested it with a file called `he's__species_abundance.txt`, and it worked fine: `he's_,0.18,s__Faecalibacterium sp900539945`. Note that I included a `'` in the name to make it weird. I suppose that it may not handle some estrange characters if they are in the first part of the name, but I think that would be unlikely. – Luis Guzman Mar 12 '21 at 02:36
  • @LuisGuzman It doesn't work with that filename on any of the versions of `awk` I've tested with (and I don't see how it could). I get errors like `awk: 1: unexpected character '''` and `awk: syntax error at source line 1`. To get it to work, I have to add double-quotes (within the single-quotes), like: `awk '{print "'$STEM'" ...`. But to protect the variable from shell processing, it should be in double-quotes (*outside* the single-quotes), like: `awk '{print "'"$STEM"'" ...`. And that'll *still* fail under some circumstances. – Gordon Davisson Mar 12 '21 at 02:48
  • @GordonDavisson, it does works with the awk on my ol7 linux vm, but I get your point. Unusual characters on the name may break it. Also, I understand the security issue related to code injection which could be a concern in some environments. I'm not against the `-v` option. When writing scripts where the above is a concern, it is the way to go. That said, for my day to day use and the scripts that I write to make my life easier around the office, it is usually not a concern, and I don't bother. – Luis Guzman Mar 12 '21 at 04:56