python counting number of presence and absence of substrings in list of sequences

Question

you can get the data here! 2shared the bottom download

I'm analyzing biological data with python.

I've written down a code to find matching substrings within a list of lists of long strings.

The substrings are in a list and have length of 7 nucleotides.

So in the list, from AAAAAAA to TTTTTTT, 16384 motifs(substrings) are present, permutating A,C,G,T.

This code has a for loop for the list of substrings and the list of lists of long strings nested inside.

It works ok, but because of the list of lists with 12000lines, the code processes very slowly.

In other words, to give info about AAAAAAA and to the next which is AAAAAAC takes 2 mins.

so it will take 16384 motifs to go through 12000 lines for 2 mins, it will take (16384*2 == 32768 min -> 546 hours -> 22days...)

I'm using scipy and numpy to get Pvalues.

What I want to is counting number of presence and absence of substrings in list of sequences

The list of long strings and the code are like this:

list_of_lists_long  =  [
[BGN,    -0.054,     AGGCAGCTGCAGCCACCGCGGGGCCTCAGTGGGGGTCTCTGG....]
[ABCB7,  0.109,      GTCACATAAGACATTTTCTTTTTTTGTTGTTTTGGACTACAT....]
[GPR143, -0.137,     AGGGGATGTGCTGGGGGTCCAGACCCCATATTCCTCAGACTC....]
[PLP2,   -0.535,     GCGAACTTCCCTCATTTCTCTCTGCAATCTGCAAATAACTCC....]
[VSIG4,  0.13,       AAATGCCCCATTAGGCCAGGATCTGCTGACATAATTGCCTAG....]
[CCNB3,  -0.071,     CAGCAGCCACAGGGCTAAGCATGCATGTTAACAGGATCGGGA....]
[TCEAL3, 0.189,      TGCCTTTGGCCTTCCATTCTGATTTCTCTGATGAGAATACGA....]
....] #12000 lines

Is there any faster logic to proceed the code much faster??

I need your help!

Thank you in advance.

=====================================

Is there any easier method, without implementing any other things?

I think the iteration for pattern match is the problem...

what I'm trying to find is BOTH how many times a length 7 motif occurs in the whole list of sequences AND NOT OCCURS ALSO!!!. So if a motif is present in a string, which is TRUE as bool, then increment a value AND FALSE, then increment an other value.

NOT the number of motifs within a string.

not important but there are some errors in your code like `for sMotif in motifs`, may be you mean `motifList` and also in `listoflists_long ` please put the strings between quotes. — elyase, Nov 16 '13 at 16:18
Take a look at suffix trees: http://www.daimi.au.dk/~mailund/suffix_tree.html — Jason Sperske, Nov 16 '13 at 16:19
a quick tip would be to check for string using `if subs in s` instead of `find`. This should get you a 2x speed up. — elyase, Nov 16 '13 at 16:29
@DSM the lengths of strings are 0 to 22561 in extreme cases, but generally ~4000 i guess. — Karyo, Nov 16 '13 at 16:41
what do you want to do with the values for each motif? You don't seem to be saving them anywhere. — elyase, Nov 16 '13 at 18:12

score 5 · Answer 1 · answered Nov 16 '13 at 16:36

Excellent question. This is a classic computer science problem. Yes, there is indeed a better algorithm. Yours processes each long string 16384 times. A better way is to process each long string only once.

Rather than searching for each motif within each long string, instead you should just record which motifs appear in each long string. For example, if you were searching for length 2 motifs in the following string:

s = 'ACGTAC'

then you could run a loop over the length 2 substrings and record which ones are present in a dict:

motifAppearances = {}
for i in range(len(s)-1):
    motif = s[i:i+2]                   # grab a length=2 substring
    if motif not in motifAppearances:
        motifAppearances[motif] = 0    # initialize the count
    motifAppearances[motif] += 1       # increment the count

Now you've processed the entire string exactly once and found all the motifs present in it. In this case the resulting dict would look like:

motifAppearances = {'AC':2, 'CG':1, 'GT':1, 'TA':1}

Doing something similar for your case should cut down your run time by exactly 16384-fold.

So, to find motifs with length 7 the code should be `range(len(s)-6):` `motif=s[i:i+7]` right? Trying to implement this but this is a bit simple to give me no idea how to do something similar like this. Anyway Thank you @dg99 — Karyo, Nov 16 '13 at 17:26
@Karyo Yes, that sounds right. I would suggest trying to solve the problem with just one of those long strings first. Once you have the dict of motifAppearances, you can then iterate over every possible motif `m` (you already have the code to generate those) and check the dict to see if `m` is present (`if m in motifAppearances ...`) and tally up the scores as necessary. Once you get that working then move on to the larger problem of multiple input strings. — dg99, Nov 16 '13 at 22:20

elyase · Answer 2 · 2013-11-17T14:15:59.643

3

A clean and very fast way (~15s with OP's data) would be to use the CountVectorizer of scikits-learn as it uses numpy under the hood, for example:

from sklearn.feature_extraction.text import CountVectorizer

def make_chunks(s):
    width = 2
    return [s[i:i+width] for i in range(len(s)-width+1)]

l = ['ATTGCGGCTCACGAA', 'ACCTAGATACGACGG', 'CCCCTGTCCATGGTA']

vectorizer = CountVectorizer(tokenizer=make_chunks)
X = vectorizer.fit_transform(l)

Now X is a sparse matrix having all possible chunks as columns and sequence as rows, where every value is the number of occurrences of the given chunk in each sequence:

>>> X.toarray()
# aa ac ag at ca cc cg ...
[[1  1  0  1  1  0  2  1 1 2 1 0 0 1 1 1]      # ATTGCGGCTCACGAA
 [0  3  1  1  0  1  2  1 2 0 1 0 2 0 0 0]      # ACCTAGATACGACGG
 [0  0  0  1  1  4  0  1 0 0 1 2 1 1 2 0]]     # CCCCTGTCCATGGTA

>>> (X.toarray()>0).astype(int)  # the same but counting only once per sequence
[[1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1]
 [0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0]
 [0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0]]

>>> vectorizer.get_feature_names()     # the columns(chunks)
[u'aa', u'ac', u'ag', u'at', u'ca', u'cc', u'cg', u'ct', u'ga', u'gc', u'gg', u'gt', u'ta', u'tc', u'tg', u'tt']

Now you can sum along the columns, mask non-values or whatever manipulation you need to do, for example:

>>> X.sum(axis=0)
[[1 4 1 3 2 5 4 3 3 2 3 2 3 2 3 1]]

Finally to find how many occurrences a given motif has, you must find the index of the corresponding motif/chunk and then evaluate in the previous sum:

>>> index = vectorizer.vocabulary_.get('ag')    # 'ag' is your motif
2   # this means third column

In your case you would have to divide your list in two parts(positive and negative values) in order to include the down condition. I made a quick test with the list from DSM's answer and it takes around 3-4 seconds in my computer to run. If I use 12 000 length 4000 sequences, then it takes a little less than a minute.

EDIT: The whole code using OP's data can be found here.

edited Nov 17 '13 at 14:15

answered Nov 17 '13 at 02:04

elyase

39,479
12
112
119

Thanks for your help @elyase, but could you adapt the code with my data? so I can get yours more easily and clearly? – Karyo Nov 17 '13 at 04:05
I think the explanation is clearer using the small test sample, I added an edit with the code I used to test in your data. – elyase Nov 17 '13 at 04:13
I have installed scikits-learn and tested your code. It is very fast but do not return what i expected. It only returns an array without motif information. What I expect is how many AAAAAAA are in the list of sequences with down-regulation and not in the list and also for up-regulated ones. So the final print out will be 'motif','motif_down','not_motif_down','motif_notdown','notmotif_notdown'. I really appreciate your help, but it is bit hard for me to understand as I'm a newbie. – Karyo Nov 17 '13 at 04:26
by the way, actually, the down-regulated is cut-off of -0.5 not the negative numbers. Not include -0.5. But what do [down] and [~down] does? Are they implemented function distinguishing negative and positive value which are in my 2nd position of data? – Karyo Nov 17 '13 at 04:31
In python you can run the code line by line and look what each variable contains. `down` is a boolean mask, ~down is just the negative of it. Numpy masks are simple arrays with boolean values that serve to select data according to a condition. To be honest I don't fully understand what final output you want(a list? a dict?). But in my code you will see that `names` is an array of motifs and `counts` is an array of the same size, having the corresponding counts, so names[0] -> counts[0], names[1] -> counts[1]. Take a look at the variables, may be with a shorter example and you will see it. – elyase Nov 17 '13 at 04:50
I know I got what you mean, but your code is finding all occurrence of motifs inside all sequences. But that does not matter with me. What matter is if there is a motif in a sequence, then count the number of sequences with the motif and the number of sequences without that motif. So the final result will be list containing a motif, number of sequence with motifs, and number of sequences without motifs. – Karyo Nov 17 '13 at 05:10
I read what you wrote and I still don't see how what you want differs from what I did. `down_counts[0]`(==2243) is a number representing how many times the motif `down_names[0]`(=='aaaaaaa') appears in all sequences. `no_down_counts[0]`(==3291) is the number of times this motif didn't appear in the sequences. Please make a example of your expected output (should it be a Nx3 array?), for example the first lines of what you are expecting. I will take my code to the form you want when I am in the computer again. – elyase Nov 17 '13 at 05:24

score 2 · Answer 3 · answered Nov 16 '13 at 17:10

There are several odd things about your code.

What you're calling "permutations" looks more like the Cartesian product, which can be computed using itertools.product.
Because Python is zero-indexed, the first element of a string is at index 0, and so the comparison like i[2].find(sMotif) < 1 will return True if the string is there right at the start, which seems a little odd.
Your OddsRatio, PValue and Enrichment calculations are inside the loop, but neither the zeroing of the counts nor the print is, which means you're computing them cumulatively for each new row but not doing anything with that information.
You recompute i[2].find(sMotif) multiple times in the typical case. That result isn't cached.

Assuming that I understand the numbers you're trying to compute -- and I could well be wrong, because there are several things you're doing that I don't understand -- I'd flip the logic. Instead of looping over each motif and trying to count it in each row, loop over each row and see what's there. That will be roughly 7*the number of rows instead of the number of motifs * the number of rows.

For example:

import random
from itertools import product
from collections import defaultdict, Counter

N = 12000
datalength = 400
listoflists = [[str(i), random.uniform(-1, 1), 
                ''.join([random.choice('AGCT') for c in range(datalength)])]
               for i in range(N)]

def chunk(seq, width):
    for i in range(len(seq)-width+1):
        yield seq[i:i+width]

def count_motifs(datatriples, width=7):
    motif_counts_by_down = defaultdict(Counter)
    nonmotif_counts_by_down = defaultdict(Counter)
    all_motifs = set(''.join(p) for p in product('AGCT',repeat=width))
    for symbol, value, sdata in datatriples:
        down = value < -0.5

        # what did we see?
        motifs_seen = set(chunk(sdata, width))
        # what didn't we see?
        motifs_not_seen = all_motifs - motifs_seen

        # accumulate these
        motif_counts_by_down[down].update(motifs_seen)
        nonmotif_counts_by_down[down].update(motifs_not_seen)

    return motif_counts_by_down, nonmotif_counts_by_down

(I lowered the linelength just to make the output faster; if the line is 10 times longer, the code takes 10 times as long.)

This produces on my slow laptop (after inserting some linebreaks):

>>> %time mot, nomot = count_motifs(listoflists, 7)
CPU times: user 1min 50s, sys: 60 ms, total: 1min 50s
Wall time: 1min 50s

So I'd figure about 20 minutes for the full problem, which isn't bad for so little code. (We could speed up the motifs_not_seen part by doing the arithmetic instead, but that'd only get us a factor of two anyway.)

On a much smaller case, where it's easier to see the output:

>>> mot, nomot = count_motifs(listoflists, 2)
>>> mot
defaultdict(<class 'collections.Counter'>, 
{False: Counter({'CG': 61, 'TC': 58, 'AT': 55, 'GT': 54, 'CA': 53, 'GA': 53, 'AC': 52, 'CT': 51, 'CC': 50, 'AG': 49, 'TA': 48, 'GC': 47, 'GG': 45, 'TG': 45, 'AA': 43, 'TT': 40}), 
True: Counter({'CT': 27, 'GT': 26, 'TC': 24, 'GC': 23, 'TA': 23, 'AC': 22, 'AG': 21, 'TG': 21, 'CC': 19, 'CG': 19, 'CA': 19, 'GG': 18, 'TT': 17, 'GA': 17, 'AA': 16, 'AT': 16})})
>>> nomot
defaultdict(<class 'collections.Counter'>, 
{False: Counter({'TT': 31, 'AA': 28, 'GG': 26, 'TG': 26, 'GC': 24, 'TA': 23, 'AG': 22, 'CC': 21, 'CT': 20, 'AC': 19, 'GA': 18, 'CA': 18, 'GT': 17, 'AT': 16, 'TC': 13, 'CG': 10}), 
True: Counter({'AA': 13, 'AT': 13, 'GA': 12, 'TT': 12, 'GG': 11, 'CC': 10, 'CA': 10, 'CG': 10, 'AG': 8, 'TG': 8, 'AC': 7, 'GC': 6, 'TA': 6, 'TC': 5, 'GT': 3, 'CT': 2})})

I appreciate your kind answer, what I'm trying to find is how many times a length 7 motif occurs in the whole list of sequences. So if a motif is present in a string, which is true, then increment the value. I should have mentioned it on the question. Thank you DSM let me understand yours. — Karyo, Nov 16 '13 at 17:18
And that's what `count_motifs(listoflists, 7)` computes. It returns the number of times a motif was seen in a `down` case, and the number of times it wasn't. It basically returns `nMotif_Down`, `nMotif_NotDown`, `nNotMotif_Down`, and `nNotMotif_NotDown`. — DSM, Nov 16 '13 at 17:21
I've been working with your code and facing a problem printing out the defaultdict with width=7. I should iterate each key and value, but it doesn't work. Could you give me any suggestion? — Karyo, Nov 17 '13 at 00:01
I think your one is the closest of what I want, but could you make the end result printing out each motif, motif_counts_by_down and nonmotif_counts_by_down? please. — Karyo, Nov 17 '13 at 02:53
@Karyo, would you mind sharing your file in order to benchmark the results? — elyase, Nov 17 '13 at 02:57
@elyase, the link is here [link](http://www.2shared.com/document/j6ecquCd/Karyos_data.html) — Karyo, Nov 17 '13 at 03:29
@Karyo, The CountVectorizer solution is really fast, I get 14 seconds with your data. — elyase, Nov 17 '13 at 03:57

score 0 · Answer 4 · answered Nov 16 '13 at 18:47

basically your problem is sequence comparision. Finding motif in sequence is a fundamental question in Bioinfomatics. I think you could search for some existed algo or packages. I searched the keywords "motif match" in google and this is what I found in first page: http://biowhat.ucsd.edu/homer/motif/ http://meme.nbcr.net/meme/doc/mast.html http://www.biomedcentral.com/1471-2105/8/189 http://www.benoslab.pitt.edu/stamp/

score 0 · Answer 5 · answered Nov 22 '13 at 11:54

0

We wrote a tool called Sylamer. It counts occurrences of words of the same given length. By default it computes hypergeometric tests in the context of ranked genes or it can output just the counts. It can do this for all possible words of a given length but it is also possible to specify a smaller list of words. It was written in C and very much optimised for speedy processing. We made the hypergeometric computation very fast by linking tot the GNU scientific library.

answered Nov 22 '13 at 11:54

micans

1,106
7
16

The OP speaking about python. Is there a python binding for your library? Otherwise this answer is in the wrong place. – Michele d'Amico Apr 14 '15 at 11:40
1

I wrote this answer 18 months ago; perhaps I meant to hint that OP could look at the code (which uses a bit-shifting and bit-masking approach) for ideas, but left it a bit too implicit. I think it is usually not a good idea to restrict yourself a priori to a particular tool, unless it concerns homework (which is problematic in other ways). I expanded a bit on the the hypergeometric aspect because that is a commonly used approach in analysing biological sequence data (for which Sylamer was especially written). – micans Apr 14 '15 at 13:14

score -1 · Accepted Answer · answered Nov 22 '13 at 07:37

-1

The problem was fisher's exact test mounting I guess. If I take the calcuation for P values outside of for loop, then the calculation becomes much faster.

answered Nov 22 '13 at 07:37

Karyo

372
2
4
21

python counting number of presence and absence of substrings in list of sequences

6 Answers6