1

I have the following tab vcf file example:

Chrom   Pos
chr1    82689
chr1    82709
chr1    93583
chr1    94111 

I would like it to look like this:

Chrom   ID              Pos
chr1    sample1_82689   82689
chr1    sample1_82709   82709
chr1    sample1_93583   93583

I have a sample names stored in a text file (136 of them), and have been using this code, when the job is ran in slurm array on the HPC, to extract the line of the ID associated with the array job number:

#!/bin/bash --login
#SBATCH --array=1-136

EXOME_IDs_FILE=/home/IDs.txt
sed -n "${SLURM_ARRAY_TASK_ID}p" $EXOME_IDs_FILE

This means that anytime {} occurs in my script, the ID from that file is extracted and can be used. Therefore, I can use that to insert the ID into the column but am struggling to figure out how to get the Pos value also into that ID column.

awk 'BEGIN{ FS=OFS="\t" } {$1 = $1 FS (NR==1? "sample_variantpos_ID" : "{}") }1' file.vcf > tmp && mv tmp file.vcf

However I do not know how to get the value of the Pos column to be attached to the ID file.

tripleee
  • 175,061
  • 34
  • 275
  • 318

1 Answers1

4

Looks like you want something like

exome=$(sed -n "${SLURM_ARRAY_TASK_ID}p"  /home/IDs.txt)
awk -v id="$exome" 'BEGIN{ FS=OFS="\t" }
{print $1, (NR==1? "sample_variantpos_ID" : id), $2}' file.vcf > tmp &&
mv tmp file.vcf

Just printing the result rather than forcing Awk to replace the input line is a very minor performance optimization, but I find it easier to read and understand, too.

In fact, you could even refactor everything into a single Awk script. Recall that everything sed can do, Awk can do better (albeit often less succinctly).

awk -v idx="$SLURM_ARRAY_TASK_ID" 'BEGIN{ FS=OFS="\t" }
NR==FNR { if(NR==idx) id=$0; next }
{print $1, (FNR==1? "sample_variantpos_ID" : id), $2}' /home/IDs.txt file.vcf > tmp &&
mv tmp file.vcf

Tangentially, see also Correct Bash and shell script variable capitalization

tripleee
  • 175,061
  • 34
  • 275
  • 318