0

I have a python script that process several files of some gigabytes. With the following code I show below, I store some data into a list, which is stored into a dictionary snp_dict. The RAM consumption is huge. Looking at my code, could you suggest some ways to reduce RAM consumption, if any?

def extractAF(files_vcf):
    z=0
    snp_dict=dict()
    for infile_name in sorted(files_vcf):
        print '      * ' + infile_name
        ###single files
        vcf_reader = vcf.Reader(open(infile_name, 'r'))
        for record in vcf_reader:
            snp_position='_'.join([record.CHROM, str(record.POS)])
            ref_F = float(record.INFO['DP4'][0])
            ref_R = float(record.INFO['DP4'][1])
            alt_F = float(record.INFO['DP4'][2])
            alt_R = float(record.INFO['DP4'][3])
            AF = (alt_F+alt_R)/(alt_F+alt_R+ref_F+ref_R)
            if not snp_position in snp_dict:
                snp_dict[snp_position]=list((0) for _ in range(len(files_vcf)))
            snp_dict[snp_position][z] = round(AF, 3) #record.INFO['DP4']
        z+=1
    return snp_dict
user2979409
  • 773
  • 1
  • 12
  • 23

2 Answers2

0

For this sort of thing, you are probably better off using another data structure. A pandas DataFrame would work well in your situation.

The simplest solution would be to use an existing library, rather than writing your own parser. vcfnp can read vcf files into a format that is easily convertible to a pandas DataFrame. Something like this should work:

import pandas as pd
    def extractAF(files_vcf):
    dfs = []
    for fname in sorted(files_vcf):
        vars = vcfnp.variants(fname, fields=['CHROM', 'POS', 'DP4'])
        snp_pos = np.char.add(np.char.add(vars.CHROM, '_'), record.POS.astype('S'))
        dp4 = vars.DP4.astype('float')
        AF = dp4[2:].sum(axis=0)/dp4.sum(axis=0)
        dfs.append(pd.DataFrame(AF, index=snp_pos, columns=[fname]).T)
    return pd.concat(dfs).fillna(0.0)

If you absolutely must use PyVCF, it will be slower, but hopefully this will at least be faster than your existing implementation, and should produce the same result as the above code:

def extractAF(files_vcf):
    files_vcf = sorted(files_vcf)
    dfs = []
    for fname in files_vcf:
        print '      * ' + fname
        vcf_reader = vcf.Reader(open(fname, 'r'))
        vars = ((rec.CHROM, rec.POS) + tuple(rec.INFO['DP4']) for rec in vcf_reader)
        df = pd.DataFrame(vars, columns=['CHROMS', 'POS', 'ref_F', 'ref_R', 'alt_F', 'alt_R'])
        df['snp_position'] = df['CHROMS'] + '_' + df['POS'].astype('S')
        df_alt = df.loc[:, ('alt_F', 'alt_R')]
        df_dp4 = df.loc[:, ('alt_F', 'alt_R', 'ref_F', 'ref_R')]
        df[fname] = df_alt.sum(axis=1)/df_dp4.sum(axis=1)
        df = df.set_index('snp_position', drop=True).loc[:, fname:fname].T
        dfs.append(df)
    return pd.concat(dfs).fillna(0.0)

Now lets say you wanted to read a particular snp_position, say contained in a variable snp_pos, that may or may not be there (from your comment), you wouldn't actually have to change anything:

all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
     linea_di_AF = all_vcf[snp_pos]

The result will be slightly different, though. It will be a pandas Series, which is like an array but can also be accessed like a dictionary:

all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
     linea_di_AF = all_vcf[snp_pos]
     f_di_AF = linea_di_AF[files_vcf[0]]

This allows you to access a particular file/snp_pos pair directly:

all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
     f_di_AF = linea_di_AF[snp_pos][files_vcf[0]]

Or, better yet:

all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
     f_di_AF = linea_di_AF.loc[files_vcf[0], snp_pos]

Or you can get all snp_pos values for a given file:

all_vcf = extractAF(files_vcf)
fpos = linea_di_AF.loc[fname]
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • `if not snp_position in snp_dict: snp_dict[snp_position]=list((0) for _ in range(len(files_vcf)))` is not arbitrary... I need 0s or 0.000s (float) – user2979409 Apr 09 '15 at 16:16
  • Fixed. The last line takes care of that for you. – TheBlackCat Apr 09 '15 at 16:23
  • how could I change ` if snp_pos in all_vcf: linea_di_AF=all_vcf[snp_pos]` in order to read the `DataFrame`? – user2979409 Apr 09 '15 at 16:28
  • I have added an explanation on how to do this – TheBlackCat Apr 09 '15 at 16:46
  • Is `vcf_reader` a sequence (you can loop over it multiple times) or is it an iterator (you can only loop over it once)? And is, for example, `record.INFO['DP4'][0]` some sort of numeric type or a string? These can be used to improve performance. – TheBlackCat Apr 10 '15 at 10:39
  • I have updated the code again. I included an alternative using a different python package, as well as a faster version of my original implementation. – TheBlackCat Apr 10 '15 at 13:04
0

I finally adopted the following implementation with MySQL:

for infile_name in sorted(files_vcf):
    print infile_name
    ###single files
    vcf_reader = vcf.Reader(open(infile_name, 'r'))
    for record in vcf_reader:
        snp_position='_'.join([record.CHROM, str(record.POS)])
        ref_F = float(record.INFO['DP4'][0])
        ref_R = float(record.INFO['DP4'][1])
        alt_F = float(record.INFO['DP4'][2])
        alt_R = float(record.INFO['DP4'][3])
        AF = (alt_F+alt_R)/(alt_F+alt_R+ref_F+ref_R)
        if not snp_position in snp_dict:
            sql_insert_table = "INSERT INTO snps VALUES ('" + snp_position + "'," + ",".join(list(('0') for _ in range(len(files_vcf)))) + ")"
            cursor = db1.cursor()
            cursor.execute(sql_insert_table)
            db1.commit()
            snp_dict.append(snp_position)
        sql_update = "UPDATE snps SET " + str(z) + "g=" + str(AF) + " WHERE snp_pos='" + snp_position + "'";
        cursor = db1.cursor()
        cursor.execute(sql_update)
        db1.commit()
    z+=1
return snp_dict
user2979409
  • 773
  • 1
  • 12
  • 23