Dear stack overflow community,
I have 100 .VCF files (a type of txt file). In the "ID" column there are different structural variant calls:
MantaINS
MantaINV
MantaDEL
MantaBND
MantaDUP
Canvas:REF
Canvas:GAIN
Canvas:LOSS
(alongside a number e.g. MantaINS:00:13:467, Canvas:Gain:594:31:23 etc)
The files look like this (but are much larger with thousands of entries per file)
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682 MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 MantaDEL:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 2658945 MantaDUP:5:22:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 1325462 MantaINV:3:000:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 5783961 CavnasREF:7:943:1453 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
7 9425916 CanvasGAIN:9:323:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
8 9425916 CanvasLOSS:2:932:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
Each file is in a separate folder and I have generated a txt file of the file paths for all 100 vcfs. This file looks like this (first 4 only):
genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz
genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz
genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz
genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz
genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz
I want to subdivide the files by type of structural variant which are found in the ID column so for each input vcf file I get 8 output files divided by ID type e.g. for Manta_INS I would like a .txt file which would have only the below line taken from the example above:
2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL END=14477381 SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
I.e. for each input vcf I would like the output to be 8 subdivided files.
(e.g. person 1.vcf -> person1_MantaINS.txt, person1_MantaDEL.txt, person1_MantaINV.txt etc)
On a single VCF file I have run:
for T in MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
bcftools view person1.vcf | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}' > ${T}.txt
done
Which works perfectly well (apart from the Canvas calls which have a colon in them). However, I want to input a list of 100 files to run the same loop over.
I have tired:
for T in MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas:REF Canvas: GAIN Canvas:LOSS
do
parallel -j6 "bcftools view {} | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}' > $basename{}.txt" :::: paths_to_files.txt
done
Which is giving me an error message: "no such file or directory" from within parallel for any of my file types.
I am working on a HPC via remote terminal.
Your help would be greatly appreciated.
Many thanks