For this sort of thing, you are probably better off using another data structure. A pandas DataFrame
would work well in your situation.
The simplest solution would be to use an existing library, rather than writing your own parser. vcfnp can read vcf files into a format that is easily convertible to a pandas DataFrame
. Something like this should work:
import pandas as pd
def extractAF(files_vcf):
dfs = []
for fname in sorted(files_vcf):
vars = vcfnp.variants(fname, fields=['CHROM', 'POS', 'DP4'])
snp_pos = np.char.add(np.char.add(vars.CHROM, '_'), record.POS.astype('S'))
dp4 = vars.DP4.astype('float')
AF = dp4[2:].sum(axis=0)/dp4.sum(axis=0)
dfs.append(pd.DataFrame(AF, index=snp_pos, columns=[fname]).T)
return pd.concat(dfs).fillna(0.0)
If you absolutely must use PyVCF, it will be slower, but hopefully this will at least be faster than your existing implementation, and should produce the same result as the above code:
def extractAF(files_vcf):
files_vcf = sorted(files_vcf)
dfs = []
for fname in files_vcf:
print ' * ' + fname
vcf_reader = vcf.Reader(open(fname, 'r'))
vars = ((rec.CHROM, rec.POS) + tuple(rec.INFO['DP4']) for rec in vcf_reader)
df = pd.DataFrame(vars, columns=['CHROMS', 'POS', 'ref_F', 'ref_R', 'alt_F', 'alt_R'])
df['snp_position'] = df['CHROMS'] + '_' + df['POS'].astype('S')
df_alt = df.loc[:, ('alt_F', 'alt_R')]
df_dp4 = df.loc[:, ('alt_F', 'alt_R', 'ref_F', 'ref_R')]
df[fname] = df_alt.sum(axis=1)/df_dp4.sum(axis=1)
df = df.set_index('snp_position', drop=True).loc[:, fname:fname].T
dfs.append(df)
return pd.concat(dfs).fillna(0.0)
Now lets say you wanted to read a particular snp_position
, say contained in a variable snp_pos
, that may or may not be there (from your comment), you wouldn't actually have to change anything:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
linea_di_AF = all_vcf[snp_pos]
The result will be slightly different, though. It will be a pandas Series
, which is like an array but can also be accessed like a dictionary:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
linea_di_AF = all_vcf[snp_pos]
f_di_AF = linea_di_AF[files_vcf[0]]
This allows you to access a particular file/snp_pos pair directly:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
f_di_AF = linea_di_AF[snp_pos][files_vcf[0]]
Or, better yet:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
f_di_AF = linea_di_AF.loc[files_vcf[0], snp_pos]
Or you can get all snp_pos
values for a given file:
all_vcf = extractAF(files_vcf)
fpos = linea_di_AF.loc[fname]