Background: I'm trying to create a simple python program that allows me to take part of a transcript by its transcriptomic coordinates and get both its sequence and its genomic coordinates. I'm not an experienced bioinformatician or programmer, more a biologist, but the way I thought about doing it would be to split each transcript to its nucleotides and store along with each nucleotide, in a tuple, both its genomic coordinates and its coordinates inside the transcript. That way I can then use python to take part of a certain transcript (say, the last 200 nucleotides) and get the sequence and the various genomic windows that construct it. The end goal is more complicated than that (the final program will receive a set of coordinates in the form of distance from the translation start site (ATG) and randomly assign each coordinate to a random transcript and output the sequence and genomic coordinates)
This is the code I wrote for this, that takes the information from a BED file containing the coordinates and sequence of each exon (along with information such as transcript length, position of start (ATG) codon, position of stop codon):
from __future__ import print_function
from collections import OrderedDict
from collections import defaultdict
import time
import sys
import os
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
content.append(line.strip().split())
all_transcripts.append(line.strip().split()[3])
all_transcripts = list(OrderedDict.fromkeys(all_transcripts))
genes = dict.fromkeys(all_transcripts)
n=0
for line in content:
n+=1
if genes[line[3]] is not None:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
if line[5] == '+':
genes[line[3]][5].extend(seq)
elif line[5] == '-':
genes[line[3]][5] = seq + genes[line[3]][5]
else:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
genes[line[3]] = [line[0],line[5],line[11],line[12],line[13],seq]
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
This is an example of how to BED file looks:
Chr Start End Transcript_ID Exon_Type Strand Ex_Start Ex_End Ex_Length Ts_Start Ts_End Ts_Length ATG Stop Sequence
chr1 861120 861180 uc001abw.1 5UTR + 0 60 80 0 60 2554 80 2126 GCAGATCCCTGCGG
chr1 861301 861321 uc001abw.1 5UTR + 60 80 80 60 80 2554 80 2126 GGAAAAGTCTGAAG
chr1 861321 861393 uc001abw.1 CDS + 0 72 2046 80 152 2554 80 2126 ATGTCCAAGGGGAT
chr1 865534 865716 uc001abw.1 CDS + 72 254 2046 152 334 2554 80 2126 AACCGGGGGCGGCT
chr1 866418 866469 uc001abw.1 CDS + 254 305 2046 334 385 2554 80 2126 AGTCCACACCCACT
I wanted to create a dictionary in which each transcript id is the key, and the values stored are the length of the transcript, the chromsome it is in, the strand, the position of the ATG, the position of the Stop codon and most importantly - a list of tuples of the sequence.
Basically, the code works. However, once the dictionary starts to get big it runs very very slowly.
So, what I would like to know is, how can I make it run faster? Currently it is getting intolerably slow in around the 60,000th line of the bed file. Perhaps there is a more efficient way to do what I'm trying to do, or just a better way to store data for this.
The BED file is custom made btw using awk from UCSC tables.
EDIT: Sharing what I learned... I now know that the bottleneck is in the creation of a large dictionary. If I alter the program to iterate by genes and create a new list everytime in a similar mechanism, with this code:
for transcript in groupby(content, lambda x: x[3]):
a = list(transcript[1])
b = a[0]
gene = [b[0],b[5],b[11],b[12],b[13]]
seq=[]
n+=1
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
for exon in a:
i=0
for nucleotide in list(exon)[14]:
seq.append((nucleotide,int(list(exon)[9])+i,int(list(exon)[1])+i))
i+=1
gene.append(seq)
It runs in less than 4 minutes, while in the former version of creating a big dictionary with all of the genes at once it takes an hour to run.