What's the fastest way to find all occurences of the substring in large string in python

Question

Edited to clarify input/output. I think this will be somewhat slow no matter what, but up until now I haven't really considered speed in my python scripts, and I'm trying to figure out ways to speed up operations like these.

My input is pickled dictionaries of the genome sequences. I'm currently working with two genomes, the budding yeast genome (11.5 MB on disk), and the human genome (2.8 GB on disk). These dictionaries have the form:

seq_d = { 'chr1' : 'ATCGCTCGCTGCTCGCT', 'chr2' : 'CGATCAGTCATGCATGCAT', 
        'chr3' : 'ACTCATCATCATCATACTGGC' }

I want to find all single-base instances of a nucleotide(s) in both strands of the genome. Where the + strand refers to the sequence in the above dictionary, and the - strand is the reverse complement of the sequences. My output is a nested dictionary, where the top level keys are + or -, the nested keys are chromosome names, and the values are lists of 0-indexed positions:

nts = 'T'
test_d = {'+': {'chr3': [2, 5, 8, 11, 14, 17], 'chr2': [3, 7, 10, 14, 18], 
           'chr1': [1, 5, 9, 12, 16]}, '-': {'chr3': [0, 4, 7, 10, 13, 15], 
           'chr2': [2, 5, 9, 13, 17], 'chr1': [0]}}

test_d defines a set of positions to examine in a large Illumina sequencing dataset later in the script.

My first attempt uses enumerate, and iteration.

import time
import numpy as np

rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
test_d = { '+' : {}, '-' : {}}
nts = 'T'

s = time.time()
for chrom in seq_d:
    plus_pos, minus_pos = [], []
    chrom_seq = seq_d[chrom]
    for pos, nt in enumerate(chrom_seq):
        if nt in nts:
            plus_pos.append(pos)
        if rev_comps[nt] in nts:
            minus_pos.append(pos)

    test_d['+'][chrom] = plus_pos
    test_d['-'][chrom] = minus_pos
e = time.time()
print 'The serial version took {} minutes...'.format((e-s)/60)

The output for yeast:

The serial version took 0.0455190300941 minutes...

The output for human:

The serial version took 10.1694028815 minutes...

I tried using numpy arrays rather than enumerate() and iteration:

s = time.time()
for chrom in seq_d:
    chrom_seq = np.array(list(seq_d[chrom]))
    nts = list(nts)
    rev_nts = [rev_comps[nt] for nt in nts]
    plus_pos = list(np.where(np.in1d(chrom_seq, nts) == True)[0])
    minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])

    test_d['+'][chrom] = plus_pos
    test_d['-'][chrom] = minus_pos
e = time.time()
print 'The numpy version took {} minutes...'.format((e-s)/60)

The output for yeast:

The numpy version took 0.0309354345004 minutes...

The output for human:

The numpy version took 9.86174853643 minutes...

Why does the numpy version lose its performance advantage for the larger human genome? Is there a faster way to do this? I tried implementing a version using multiprocessing.Pool, but that's slower than either version:

def getTestHelper(chrom_seq, nts, rev_comp):

    rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
    if rev_comp:
        nts = [rev_comps[nt] for nt in nts]
    else:
        nts = list(nts)
    chrom_seq = np.array(list(chrom_seq))
    mask = np.in1d(chrom_seq, nts)
    pos_l = list(np.where(mask == True)[0])
    return pos_l

s = time.time()
pool = Pool(4)
plus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=False), seq_d.values())
minus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=True), seq_d.values())
e = time.time()
print 'The parallel version took {} minutes...'.format((e-s)/60)

I haven't run this on the human genome, but the yeast version is slower:

The parallel version took 0.652778700987 minutes...

Can you please provide an actual description of the task you're doing? As in, what is the input and what is the output supposed to be? — BrenBarn, Sep 03 '16 at 20:25
Looks like iterating over `seq_d` is probably the issue. Especially as you convert it to a numpy array in the for loop. — Andrew, Sep 03 '16 at 20:27
Your problem looks worth and interesting to give it a shot. I think you'd receive more help if you provided the required input and the expected output. Said otherwise, [mcve](http://stackoverflow.com/help/mcve) — BPL, Sep 03 '16 at 20:47
I've edited to try to clarify the input, and desired output. Memory use shouldn't be a factor. I'm currently running these test cases on my laptop, but I'll be running the final code on a server with plenty of memory. — HikerT, Sep 03 '16 at 22:55

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

3

Use built-ins

Instead of manually iterating through your long string, try str.find or str.index. Don't slice the string yourself, use these methods' built-in slicing.

This also ditches enumerate-ing, although that shouldn't be costly anyway.

Also, you could use set to store indices, not list - additions could be faster.

You would have to do it twice, though, to find both your nucleotide and its complement. Of course, look up the complement outside of the loop.

Try regular expressions

You could also try regular expressions to do the same thing (if you are going to try this, try both 2 regex (for "T" and "A") and one for "T|A").

Also, instead of doing

for chrom in seq_d:

You could do

for chromosome_number, chomosome in seq_d.items():

Which has little to do with performance, but makes the code more readable.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 03 '16 at 20:44

Daerdemandt

2,281
18
19

`str.find` will only find the first instance though? Would it be better to do `len(chrom_seq.split(search_string))` to find the number of times it occurs? Or `str.count` for the number, I guess – Andrew Sep 03 '16 at 20:53
1

I doubt sets will improve performance in this case given [append/set worst cases](https://wiki.python.org/moin/TimeComplexity). – Peter Brittain Sep 03 '16 at 21:14
@Andrew find occurence, store its index, look for next occurence in remaining string. – Daerdemandt Sep 03 '16 at 21:18
If the search element occurs a lot, that could still be a slow iteration I guess. – Andrew Sep 03 '16 at 21:19
@PeterBrittain Yeah, sets have to grow too... I suppose it would depend on how often that process is triggered, i.e. on actual data, so `timeit` will show. Maybe arrays would actually be faster. – Daerdemandt Sep 03 '16 at 21:25
@Andrew Sure, doing faster than explicitly iterating through all matches would be hard, but to get as close to that limit as possible shifting work to built-ins would be nice. – Daerdemandt Sep 03 '16 at 21:32
the human genome is 3x10^9 bases, with a GC content of ~42%. So I'm expecting ~.87x 10^9 positions on the '+' strand, and the same on the '-' strand. – HikerT Sep 03 '16 at 23:00

score 3 · Accepted Answer · edited May 23 '17 at 12:31

To be honest, I think you've been doing the right things.

There are a few more tweaks you can make to your code, though. In general, when performance is key, only do the bare minimum in your innermost loops. Looking through your code, there are still some quick optimizations left on this front:

Use if...elif instead of if...if.
Don't use lists where you don't have to - e.g. just a single string is sufficient for nts and the reverse.
Don't evaluate the same result multiple times - e.g. the reverse lookup.

I'm guessing that your problem with multi-processing is down to the serialization of these very large strings, offsetting any performance gain you might have from running in parallel. However, there may be another way to do this - see Parallelizing a Numpy vector operation. I can't verify as I am having difficulty installing numexpr.

Putting them together and trying out some of the other suggestions in this trail, I get the following results:

$ python test.py
Original serial took 0.08330821593602498 minutes...
Using sets took 0.09072601397832235 minutes...
Using built-ins took 0.061421032746632895 minutes...
Using regex took 0.11649663050969442 minutes...
Optimized serial took 0.05909080108006795 minutes...
Original numpy took 0.04050511916478475 minutes...
Optimized numpy took 0.03438538312911987 minutes...

My code is as follows.

import time
import numpy as np
from random import choice
import re

# Create single large chromosome for the test...
seq = ""
for i in range(10000000):
    seq += choice("ATGCN")
seq_d = {"Chromosome1": seq}


rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
nts = 'T'

# Original serial implementation
def serial():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt in nts:
                plus_pos.append(pos)
            if rev_comps[nt] in nts:
                minus_pos.append(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Optimized for single character tests
def serial2():
    test_d = {'+': {}, '-': {}}
    rev_nts = rev_comps[nts]
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt == nts:
                plus_pos.append(pos)
            elif nt == rev_nts:
                minus_pos.append(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use sets instead of lists
def set_style():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = set(), set()
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt in nts:
                plus_pos.add(pos)
            if rev_comps[nt] in nts:
                minus_pos.add(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use regex to find either nucleotide...
def regex_it():
    test_d = { '+' : {}, '-' : {}}
    search = re.compile("(T|A)")
    for chrom in seq_d:
        pos = 0
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for sub_seq in search.split(chrom_seq):
            if len(sub_seq) == 0:
                continue
            if sub_seq[0] == 'T':
                plus_pos.append(pos)
            elif sub_seq[0] == 'A':
                minus_pos.append(pos)
            pos += len(sub_seq)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use str.find instead of iteration
def use_builtins():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        start = 0
        while True:
            pos = chrom_seq.find("T", start)
            if pos == -1:
                break
            plus_pos.append(pos)
            start = pos + 1

        start = 0
        while True:
            pos = chrom_seq.find("A", start)
            if pos == -1:
                break
            minus_pos.append(pos)
            start = pos + 1

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Original numpy implementation
def numpy1():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        chrom_seq = np.array(list(seq_d[chrom]))
        for_nts = list(nts)
        rev_nts = [rev_comps[nt] for nt in nts]
        plus_pos = list(np.where(np.in1d(chrom_seq, for_nts) == True)[0])
        minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Optimized for single character look-ups
def numpy2():
    test_d = {'+': {}, '-': {}}
    rev_nts = rev_comps[nts]
    for chrom in seq_d:
        chrom_seq = np.array(list(seq_d[chrom]))
        plus_pos = np.where(chrom_seq == nts)
        minus_pos = np.where(chrom_seq == rev_nts)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

for fn, name in [
        (serial, "Original serial"),
        (set_style, "Using sets"),
        (use_builtins, "Using built-ins"),
        (regex_it, "Using regex"),
        (serial2, "Optimized serial"),
        (numpy1, "Original numpy"),
        (numpy2, "Optimized numpy")]:
    s = time.time()
    fn()
    e = time.time()
    print('{} took {} minutes...'.format(name, (e-s)/60))

Thanks for these suggestions, and analysis. I wasn't clear in the original question, but I want it to work for both single nucleotides, and multiple nucleotides, so I need to keep the if-if. `rev_nts = rev_comps[nts]` really helps the serial implementation a lot. I'll try to keep as many steps outside of the loops as possible. I think I'll keep the optimized serial version, as at scale of the human genome the slight advantage of the numpy implementation goes away. — HikerT, Sep 05 '16 at 13:27
I think I'm spending too much time on this part, when the downstream computations will probably be the slowest, especially since I'm locked in to a legacy input data representation. — HikerT, Sep 05 '16 at 13:44
Fair enough. Note, though, that there has been much study of and so [many algorithms](https://en.wikipedia.org/wiki/String_searching_algorithm) for efficient searching of sub-strings in a string. However, none of them will help for a single character sub-string. — Peter Brittain, Sep 05 '16 at 14:20
@HikerT > *I want it to work for both single nucleotides, and multiple nucleotides* Do you mean sets or sequences? Because with sequences things get a bit tricky. If you are insistent on not using built-ins, you could at least use algos from [`there`](https://hg.python.org/releasing/3.5.2/file/tip/Objects/stringlib/fastsearch.h) ([explanation](http://effbot.org/zone/stringlib.htm)). — Daerdemandt, Sep 06 '16 at 09:34
Sorry, unclear. I'll never want to find anything greater than a single nucleotide, but one of the user supplied parameters is a string of the individual nucleotides to consider. `AT` would mean that the user wants to find all instances of `A` or `T` in the genome. In this case the `if-elif` statements wouldn't work. — HikerT, Sep 06 '16 at 14:45

Andrew · Answer 3 · 2016-09-03T21:18:44.677

Finds the longest chromosome string, and creates an empty array with one row per chromosome, and columns up to the longest one in the dictionary. Then it puts each chromosome into its own row, where you can call np.where on the whole array

import numpy as np

longest_chrom = max([len(x) for x in seq_d.values()])
genome_data = np.zeros((len(seq_d), dtype=str, longest_chrom)
for index, entry in enumerate(seq_d):
    gen_string = list(seq_d[entry]
    genome_data[index, :len(gen_string) + 1] =  gen_string
nuc_search = "T"
nuc_locs = np.where(genome_data == nuc_search)

Using this method, it's slightly different for strings of > 1 nucleotide, but I'm sure it's achievable in only a couple more steps.

What's the fastest way to find all occurences of the substring in large string in python

3 Answers3

Use built-ins

Try regular expressions