Edited to clarify input/output. I think this will be somewhat slow no matter what, but up until now I haven't really considered speed in my python scripts, and I'm trying to figure out ways to speed up operations like these.
My input is pickled dictionaries of the genome sequences. I'm currently working with two genomes, the budding yeast genome (11.5 MB on disk), and the human genome (2.8 GB on disk). These dictionaries have the form:
seq_d = { 'chr1' : 'ATCGCTCGCTGCTCGCT', 'chr2' : 'CGATCAGTCATGCATGCAT',
'chr3' : 'ACTCATCATCATCATACTGGC' }
I want to find all single-base instances of a nucleotide(s) in both strands of the genome. Where the +
strand refers to the sequence in the above dictionary, and the -
strand is the reverse complement of the sequences. My output is a nested dictionary, where the top level keys
are +
or -
, the nested keys
are chromosome names, and the values are lists of 0-indexed positions:
nts = 'T'
test_d = {'+': {'chr3': [2, 5, 8, 11, 14, 17], 'chr2': [3, 7, 10, 14, 18],
'chr1': [1, 5, 9, 12, 16]}, '-': {'chr3': [0, 4, 7, 10, 13, 15],
'chr2': [2, 5, 9, 13, 17], 'chr1': [0]}}
test_d
defines a set of positions to examine in a large Illumina sequencing dataset later in the script.
My first attempt uses enumerate
, and iteration.
import time
import numpy as np
rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
test_d = { '+' : {}, '-' : {}}
nts = 'T'
s = time.time()
for chrom in seq_d:
plus_pos, minus_pos = [], []
chrom_seq = seq_d[chrom]
for pos, nt in enumerate(chrom_seq):
if nt in nts:
plus_pos.append(pos)
if rev_comps[nt] in nts:
minus_pos.append(pos)
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
e = time.time()
print 'The serial version took {} minutes...'.format((e-s)/60)
The output for yeast:
The serial version took 0.0455190300941 minutes...
The output for human:
The serial version took 10.1694028815 minutes...
I tried using numpy arrays rather than enumerate()
and iteration:
s = time.time()
for chrom in seq_d:
chrom_seq = np.array(list(seq_d[chrom]))
nts = list(nts)
rev_nts = [rev_comps[nt] for nt in nts]
plus_pos = list(np.where(np.in1d(chrom_seq, nts) == True)[0])
minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])
test_d['+'][chrom] = plus_pos
test_d['-'][chrom] = minus_pos
e = time.time()
print 'The numpy version took {} minutes...'.format((e-s)/60)
The output for yeast:
The numpy version took 0.0309354345004 minutes...
The output for human:
The numpy version took 9.86174853643 minutes...
Why does the numpy version lose its performance advantage for the larger human genome? Is there a faster way to do this? I tried implementing a version using multiprocessing.Pool
, but that's slower than either version:
def getTestHelper(chrom_seq, nts, rev_comp):
rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
if rev_comp:
nts = [rev_comps[nt] for nt in nts]
else:
nts = list(nts)
chrom_seq = np.array(list(chrom_seq))
mask = np.in1d(chrom_seq, nts)
pos_l = list(np.where(mask == True)[0])
return pos_l
s = time.time()
pool = Pool(4)
plus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=False), seq_d.values())
minus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=True), seq_d.values())
e = time.time()
print 'The parallel version took {} minutes...'.format((e-s)/60)
I haven't run this on the human genome, but the yeast version is slower:
The parallel version took 0.652778700987 minutes...