Optimizing performance using a big dictionary in Python

Question

Background: I'm trying to create a simple python program that allows me to take part of a transcript by its transcriptomic coordinates and get both its sequence and its genomic coordinates. I'm not an experienced bioinformatician or programmer, more a biologist, but the way I thought about doing it would be to split each transcript to its nucleotides and store along with each nucleotide, in a tuple, both its genomic coordinates and its coordinates inside the transcript. That way I can then use python to take part of a certain transcript (say, the last 200 nucleotides) and get the sequence and the various genomic windows that construct it. The end goal is more complicated than that (the final program will receive a set of coordinates in the form of distance from the translation start site (ATG) and randomly assign each coordinate to a random transcript and output the sequence and genomic coordinates)

This is the code I wrote for this, that takes the information from a BED file containing the coordinates and sequence of each exon (along with information such as transcript length, position of start (ATG) codon, position of stop codon):

from __future__ import print_function
from collections import OrderedDict
from collections import defaultdict
import time
import sys
import os

with open("canonicals_metagene_withseq.bed") as f:
    for line in f:
        content.append(line.strip().split())
        all_transcripts.append(line.strip().split()[3])
all_transcripts = list(OrderedDict.fromkeys(all_transcripts))
genes = dict.fromkeys(all_transcripts)
n=0
for line in content:
    n+=1
    if genes[line[3]] is not None:
        seq=[]
        i=0
        for nucleotide in line[14]:
            seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
            i+=1
        if line[5] == '+':
            genes[line[3]][5].extend(seq)
        elif line[5] == '-':
            genes[line[3]][5] = seq + genes[line[3]][5] 
    else:
        seq=[]
        i=0
        for nucleotide in line[14]:
            seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
            i+=1
        genes[line[3]] = [line[0],line[5],line[11],line[12],line[13],seq]
    sys.stdout.write("\r")
    sys.stdout.write(str(n))
    sys.stdout.flush()

This is an example of how to BED file looks:

Chr Start   End Transcript_ID   Exon_Type   Strand  Ex_Start    Ex_End  Ex_Length   Ts_Start    Ts_End  Ts_Length   ATG Stop    Sequence
chr1    861120  861180  uc001abw.1      5UTR    +       0       60      80      0       60      2554    80      2126    GCAGATCCCTGCGG
chr1    861301  861321  uc001abw.1      5UTR    +       60      80      80      60      80      2554    80      2126    GGAAAAGTCTGAAG
chr1    861321  861393  uc001abw.1      CDS     +       0       72      2046    80      152     2554    80      2126    ATGTCCAAGGGGAT
chr1    865534  865716  uc001abw.1      CDS     +       72      254     2046    152     334     2554    80      2126    AACCGGGGGCGGCT
chr1    866418  866469  uc001abw.1      CDS     +       254     305     2046    334     385     2554    80      2126    AGTCCACACCCACT

I wanted to create a dictionary in which each transcript id is the key, and the values stored are the length of the transcript, the chromsome it is in, the strand, the position of the ATG, the position of the Stop codon and most importantly - a list of tuples of the sequence.

Basically, the code works. However, once the dictionary starts to get big it runs very very slowly.

So, what I would like to know is, how can I make it run faster? Currently it is getting intolerably slow in around the 60,000th line of the bed file. Perhaps there is a more efficient way to do what I'm trying to do, or just a better way to store data for this.

The BED file is custom made btw using awk from UCSC tables.

EDIT: Sharing what I learned... I now know that the bottleneck is in the creation of a large dictionary. If I alter the program to iterate by genes and create a new list everytime in a similar mechanism, with this code:

for transcript in groupby(content, lambda x: x[3]):
    a = list(transcript[1])
    b = a[0]
    gene = [b[0],b[5],b[11],b[12],b[13]]
    seq=[]
    n+=1
    sys.stdout.write("\r")
    sys.stdout.write(str(n))
    sys.stdout.flush()
    for exon in a:
        i=0
        for nucleotide in list(exon)[14]:
            seq.append((nucleotide,int(list(exon)[9])+i,int(list(exon)[1])+i))
            i+=1
    gene.append(seq)

It runs in less than 4 minutes, while in the former version of creating a big dictionary with all of the genes at once it takes an hour to run.

This is exactly the sort of problem that caused me to break down and install MySQL. — heathobrien, Apr 27 '15 at 13:07
Good point, This is kind of the whole reason a database exists, so you dont have to store a dict of 60,000 items. — chown, Apr 27 '15 at 13:13
Isn't this better suited for CodeReview? Since the program is already working (but slowly) — f.rodrigues, Apr 27 '15 at 14:04

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

One this to make your code more efficient is to add to the dictionary as you read the file.

with open("canonicals_metagene_withseq.bed") as f:
for line in f:
    all_transcripts.append(line.strip().split()[3])
    add_to_content_dict(line)

and then your add_to_content_dict() function would look like the code inside the for line in content: loop

(see here)

Also, you have to define your defaultdicts as such, I don't see where genes or another dict is defined as a defaultdict.

edited May 23 '17 at 11:50

Community

1
1

answered Apr 27 '15 at 13:07

philshem

24,761
8
61
127

Or, just iterate through it directly: `for i, line in enumerate(open("file.bed")):` and the enumerate to get ride of the `n` counter – chown Apr 27 '15 at 13:11
I probably will do it at some stage but I'm quite sure that is not the major bottleneck. Regarding defaultdict - I wanted to use it but wasn't sure how to implement it since my values are not simple lists but lists with a list of tuples inside under a constant index. – Knowname Apr 27 '15 at 13:11
It shouldn't matter as long as the top level list is defined this way: `genes = defaultdict(list)` – philshem Apr 27 '15 at 13:18

score 0 · Answer 2 · answered Apr 27 '15 at 14:01

This might be a good read, which details the practice of defining all dot-notation functions used inside of your loop outside as variables to enhance performance, because you aren't looking up the definition in every iteration of the loop. For example, instead of

 for line in f:
     transcripts.append(line.strip().split()[3])

you would have

 f_split = str.split
 f_strip = str.strip
 f_append = all_transcripts.append
 for line in f:
     f_append(f_split(f_strip(line))[3])

There are other goodies in that link about local variable access and is, again, definitely worth the read.

You may also consider using the Cython, PyInline, Pyrex, or PyPy libraries to use C code within your Python script (for the efficiency when dealing with lots and lots of iteration and/or file I/O.)

As for the data structure itself (which was your major concern), we're limited in python to how much control over a dictionary's expansion we have. Big dicts do get heavier on the memory consumption as they get bigger... but so do all data structures! You have a couple options that may make a minute difference (storing strings as bytestrings / using a translation dict for encoded integers), but you may want to consider implementing a database instead of holding all that stuff in a python dict during runtime.

Optimizing performance using a big dictionary in Python

2 Answers2