you can get the data here! 2shared the bottom download
I'm analyzing biological data with python.
I've written down a code to find matching substrings within a list of lists of long strings.
The substrings are in a list and have length of 7 nucleotides.
So in the list, from AAAAAAA to TTTTTTT, 16384 motifs(substrings) are present, permutating A,C,G,T.
This code has a for loop for the list of substrings and the list of lists of long strings nested inside.
It works ok, but because of the list of lists with 12000lines, the code processes very slowly.
In other words, to give info about AAAAAAA and to the next which is AAAAAAC takes 2 mins.
so it will take 16384 motifs to go through 12000 lines for 2 mins, it will take (16384*2 == 32768 min -> 546 hours -> 22days...)
I'm using scipy and numpy to get Pvalues.
What I want to is counting number of presence and absence of substrings in list of sequences
The list of long strings and the code are like this:
list_of_lists_long = [
[BGN, -0.054, AGGCAGCTGCAGCCACCGCGGGGCCTCAGTGGGGGTCTCTGG....]
[ABCB7, 0.109, GTCACATAAGACATTTTCTTTTTTTGTTGTTTTGGACTACAT....]
[GPR143, -0.137, AGGGGATGTGCTGGGGGTCCAGACCCCATATTCCTCAGACTC....]
[PLP2, -0.535, GCGAACTTCCCTCATTTCTCTCTGCAATCTGCAAATAACTCC....]
[VSIG4, 0.13, AAATGCCCCATTAGGCCAGGATCTGCTGACATAATTGCCTAG....]
[CCNB3, -0.071, CAGCAGCCACAGGGCTAAGCATGCATGTTAACAGGATCGGGA....]
[TCEAL3, 0.189, TGCCTTTGGCCTTCCATTCTGATTTCTCTGATGAGAATACGA....]
....] #12000 lines
Is there any faster logic to proceed the code much faster??
I need your help!
Thank you in advance.
=====================================
Is there any easier method, without implementing any other things?
I think the iteration for pattern match is the problem...
what I'm trying to find is BOTH how many times a length 7 motif occurs in the whole list of sequences AND NOT OCCURS ALSO!!!. So if a motif is present in a string, which is TRUE as bool, then increment a value AND FALSE, then increment an other value.
NOT the number of motifs within a string.