0

I am trying to extract positions and SNPs from a VCF file. I have written the following so far. But how can I change the name of the dictionary so that I end up with one dictionary for each input file?

i.e.: python vcf_compare.py file1.vcf file2.vcf file3.vcf

import sys

import vcf

for variants in sys.argv[1:]:
    file1 = {} 
    vcf_reader = vcf.Reader(open(variants))
    for record in vcf_reader:
        pos = record.POS
        alt = record.ALT
        ref= record.REF
        snps[pos]=ref,alt

so for argv[1] a dictionary called file1 is created. How can I make the dictionary change name to e.g. file two for the second iteration of the loop?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Sam Lipworth
  • 107
  • 4
  • 12

2 Answers2

1

You should use collections.defaultdict and also use with open(...):

from collections import defaultdict

files = defaultdict(dict)
for filename in sys.argv[1:]:
    with open(filename) as f:
        vcf_reader = vcf.Reader(f)
        for record in vcf_reader:
            files[filename][record.POS] = record.REF, record.ALT

All these nice python tricks make the code more readable, shorter, uses less intermediate temporary variables. Also, using with open() ensures each file gets automatically closed after being read.

Also, as you can see, you can chose better variable names, and also reduce considerably the number of lines of code.

DevLounge
  • 8,313
  • 3
  • 31
  • 44
  • Or you could use a list. – juanpa.arrivillaga Mar 30 '17 at 21:40
  • Of course. The only advantage of using a defaultdict is that the OP could access its content by a dynamically generated key name instead of using an index. – DevLounge Mar 30 '17 at 21:43
  • 1
    @juanpa.arrivillaga The OP seems to want to access by name instead of needing to keep track of indices. The dictionary allows strings to be used as keys, and thus fits the indicated needs better than a list. – TurnipEntropy Mar 30 '17 at 21:44
1

Short answer: you can't. This is an incredibly frustrating fact to many early programmers. The fix: another dictionary! outside of your variants for loop, create another dictionary and use the filename as a key. Example (you can't just copy paste this, because I don't know how to use the vcf library):

import sys

import vcf

all_files = {}
for variants in sys.argv[1:]:
    #didn't see file1 used, and didn't see snps created
    #so figured file1 was snps...
    snps = {} 
    vcf_reader = vcf.Reader(open(variants))
    for record in vcf_reader:
        pos = record.POS
        alt = record.ALT
        ref= record.REF
        snps[pos]=ref,alt
    all_files[variants] = snps

I'm assuming here that variants is a filename in the form of a string. If not, replace the variants in all_files[variants] with the string you want to use as its key.

TurnipEntropy
  • 527
  • 5
  • 12