2

Dear stack overflow community,

I have 100 .VCF files (a type of txt file). In the "ID" column there are different structural variant calls:

MantaINS
MantaINV
MantaDEL
MantaBND
MantaDUP
Canvas:REF
Canvas:GAIN
Canvas:LOSS 

(alongside a number e.g. MantaINS:00:13:467, Canvas:Gain:594:31:23 etc)

The files look like this (but are much larger with thousands of entries per file)

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682  MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 MantaDEL:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 2658945 MantaDUP:5:22:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 1325462 MantaINV:3:000:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 5783961 CavnasREF:7:943:1453 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
7 9425916 CanvasGAIN:9:323:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
8 9425916 CanvasLOSS:2:932:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

Each file is in a separate folder and I have generated a txt file of the file paths for all 100 vcfs. This file looks like this (first 4 only):

genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz   
genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz    
genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz   
genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz
genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz

I want to subdivide the files by type of structural variant which are found in the ID column so for each input vcf file I get 8 output files divided by ID type e.g. for Manta_INS I would like a .txt file which would have only the below line taken from the example above:

2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL END=14477381 SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

I.e. for each input vcf I would like the output to be 8 subdivided files.

(e.g. person 1.vcf -> person1_MantaINS.txt, person1_MantaDEL.txt, person1_MantaINV.txt etc)

On a single VCF file I have run:

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   bcftools view person1.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > ${T}.txt
done

Which works perfectly well (apart from the Canvas calls which have a colon in them). However, I want to input a list of 100 files to run the same loop over.

I have tired:

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas:REF Canvas: GAIN Canvas:LOSS
    do
       parallel -j6 "bcftools view {}  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > $basename{}.txt" :::: paths_to_files.txt
    done

Which is giving me an error message: "no such file or directory" from within parallel for any of my file types.

I am working on a HPC via remote terminal.

Your help would be greatly appreciated.

Many thanks

tacrolimus
  • 500
  • 2
  • 12
  • Your requiredment is unclear, and probably why you have a vote-to-close against your Q. Please edit your Q to show one input file and then the filenames required for output. We need to see exact output required, not a verbal description ;-). Good luck. – shellter Oct 29 '19 at 16:47
  • @shellter Many thanks for the feedback. I have updated the question to try an be clearer. – tacrolimus Oct 29 '19 at 18:11
  • Hmm, still some issues with your Q. Don't say "Which is giving me an error message.", but include the actual text of the error message you are getting. As is we can't tell if your error is a shell error or something to do with `bcftools`, or `parallel`. Going back to earlier in your Q, you mention `person1`. Is this to assume there is `person2 ... personN`? Why not wrap your `for T` inside another `for P in person1 ... personN` loop? I would remove any extra attempts to solve your problem and only leave in the version that you *really* thought should work or is your best attempt.... – shellter Oct 29 '19 at 18:30
  • AND does your data really have blank lines in between? If not, can you please condense your data. If it does have blank lines, then indicate that below the sample data as a parenthetical comment). Sorry, not trying to beat up on you, just making a problem clear for others is a small art in itself, so this is feedback only. Good luck. – shellter Oct 29 '19 at 18:32
  • @shellter I've tried to make it clearer as requested. No you're right this feedback is helpful as the clearer I can be the higher the chances of success. I appreciate more seasoned members of this community are probably tired of badly asked questions etc etc. Many thanks. – tacrolimus Oct 29 '19 at 19:01
  • Very good! If you don't want to use a wrapping `for P` loop and "must" use parallel, then I recommend adding a tag for `gnu-parallel`. I'd have to do some research to understand your `parallel` cmd line and I'm busy today. OR I would write an awk script that filters like `awk -v T="$T" '/MantaINS/{print $0 > "MantaINS "T".out}' InFile` . That should be plenty fast. Good luck. – shellter Oct 29 '19 at 19:27
  • You can have all of your `MantaINS` like targets in one script and as each line triggers a match, it will write to the appropriate outFile. Experiment with one, before messing with the whole project ;-). Got to go. Good luck. – shellter Oct 29 '19 at 19:38
  • Your invocation of parallel has only a single double-quote ("), so something's not been typed in properly. Note that shell variables will be expanded inside double-quotes before parallel sees them (`$3`, `$0`, `$basename`, etc) – jhnc Oct 30 '19 at 03:20
  • @shellter thanks I will have a look at your script – tacrolimus Oct 30 '19 at 12:41
  • @jhnc sorry I have added the second quotation marks - I work in an airlock environment so have to manually type out any issues I am having! – tacrolimus Oct 30 '19 at 12:42

1 Answers1

0

You write that this works perfectly for a single VCF-file:

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   bcftools view person1.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > ${T}.txt
done

then this should also work:

doit() {
    vcf="$1"
    out="$2"
    T="$3"
    bcftools view "$vcf" |
       awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > "$out"
}

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   doit person1.vcf person1_${T}.txt ${T}
done

and if that works, then this should also work:

export -f doit
parallel doit {1} {1.}_{2}.txt {2} \
:::: list_of_vcf_files \
::: MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas

If this is not what you want, please show 3 full example of the commands you want executed.

(It is also unclear to me exactly what command you want run for Canvas:GAIN, so please let that be one of the 3 examples).

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Dear Ole, Many thanks for taking the time to answer my question. I have tried running the script and I keep getting an error along the lines of "no such file or directory" - the error it give is the filepaths from "list_of_vcf_files" appended to the MantaINS Manta INV etc tags. For example a filepath maybe genomes/18-11-10/HB39013/LI9034.sv.vcf and now the error is saying $mypath/genomes/18-11-10/HB39013/LI9034.sv.vcf_MantaINS.txt - no such file or directory. – tacrolimus Nov 11 '19 at 22:26
  • The VCF files I want to analyse are kept in a seperate read-only folder to where I would like the output to be if thats helpful information at all? – tacrolimus Nov 11 '19 at 22:30