I am trying to merge hundreds of samples files that contain species names and proportions into one file in long-format using bash script. I wonder how to add some characters at the beginning of each line of awk output.
I have some sampleID that I saved in the variable $STEM
. I used awk
to get the species names and proportions from each file. Proportion is at the beginning of each line; species name is at the end (6th place) of each line (tab-separated). But I also want to add the sampleID ($STEM
) to the beginning of each line in the output file. Here is my code:
for file in $input_dir/*_species_abundance.txt
do
STEM=$(basename "$file" _species_abundance.txt )
echo "processing sample $STEM"
awk '{print "$STEM," $1,$6}' FS='\t' $file >> $input_dir/merged_species_abundance.txt
done
The "$STEM,"
part doesn't work as expected, because the current output is "$STEM" instead of substituting it with the sampleID.
Do you have any suggestions on how I can modify my code? Thank you in advance!
Here is some sample input:
0.45 124078 0 S 148633 s__Faecalibacterium prausnitzii_D
0.35 95476 0 S 145938 s__Faecalibacterium prausnitzii_C
0.21 57002 0 S 158191 s__Faecalibacterium prausnitzii_I
0.18 49503 0 S 224832 s__Faecalibacterium sp900539945
0.07 18991 0 S 157095 s__Faecalibacterium prausnitzii_G
0.04 12007 0 S 187396 s__Faecalibacterium prausnitzii_F
...
...
The first number is the proportion, and the last word is the species name.
The sampleID is something like 1001, 1002, 1003, ...
My desired output would be (comma-separate):
1001,0.45,s__Faecalibacterium prausnitzii_D
1001,0.35,s__Faecalibacterium prausnitzii_C
1001,0.21,s__Faecalibacterium prausnitzii_I
...
1002,0.28,s__Faecalibacterium prausnitzii_D
1002,0.00,s__Faecalibacterium prausnitzii_C
1002,0.01,s__Faecalibacterium prausnitzii_I
...
1003,0.60,s__Faecalibacterium prausnitzii_D
1003,0.02,s__Faecalibacterium prausnitzii_C
1003,0.39,s__Faecalibacterium prausnitzii_I
...
...