0

I have a string and some characters of the string can have some alternative 'weights' as follows: A S [27.0, 0] D S [27.0, 0] N [1.0, -20.0, 0] P So, for the string 'ASDSNP', I've weights 27.0 and 0.0 for each S and 1.0, -20.0 and 0.0 for each N etc.

I'm generating all the possible combination of the weights for a given string and getting their sum, so for each string I'can have multiple weight sums depending on the weight combinations and I'm populating a dict with different strings and their sum of weights.

I'm generating many sub strings from a long string following some rule and generating the sum by this code, which is taking quite long time and I'm controlling the explosion of combinations keeping only a small number of combinations, max_totalcomb = 2

Can somebody suggest any method to make this run fast? Thanks in advance

import itertools as it
import time
start_time=time.time()

def generate_substr(seq=None):
    for f in range(1,len(seq)):
        for g in range(1,len(seq)):
            if g >f :
                yield seq[f:g]

def wt_combinator(anystr=None,max_totalcomb = 2):
    '''
    Returns sum of a list of combinatorial modification weights
    '''
    wt_list=[]
    
    anystr = anystr.upper()
    wt_sites=char_wt.keys()

    for va in anystr:
        if va in wt_sites:
           wt_list.append(char_wt[va])

    wt_combination=list(it.product(*wt_list))
    filtered_combination=[e for e in wt_combination if 0 <= sum([int(bool(x)) for x in e]) <= max_totalcomb ]
    sumwt_list=[sum(cmb) for cmb in filtered_combination]
    if len(sumwt_list):
        return sumwt_list
    else:
        return [0]


char_wt= {   'N': [1.0, -20.0, 0],
             'Q': [1.0, 0],
             'S': [27.0, 0],
             'T': [27.0, 0],
             'Y': [27.0, 0]}
my_string = "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"

str_db = dict()
dbcounter = 0

for fseq in generate_substr(my_string):
    swt = wt_combinator(fseq)
    for w in swt:
        str_db[dbcounter]={'seq':fseq,'sumw':w}
        dbcounter +=1


end_time=time.time()
print(end_time - start_time)
print(len(str_db))
The August
  • 469
  • 2
  • 7
  • 18
  • [This answer](https://stackoverflow.com/a/54487100/8451814) demonstrates a faster way to generate your substrings. – Woodford Nov 28 '22 at 17:23
  • I think the combination generation is the bottleneck here and not the substring generation. If I run the script without that part which generates combination, the script runs quite fast – The August Nov 28 '22 at 17:34
  • Agreed. That's why it's just a comment rather than an answer :) – Woodford Nov 28 '22 at 17:35
  • The combination generation and weight calculation should be done together. Given a string of length `n`, say `s` and its weights, getting the weight for `s + c`(char) would be much faster than calculating all the weights and its combinations again for string of length `n+1` from scratch due to `it.product()` being exponential in complexity if all chars are part of `char_wt` – Jay Nov 28 '22 at 17:38
  • I'm using the Cartesian Product function from this post https://stackoverflow.com/questions/1208118/using-numpy-to-build-an-array-of-all-combinations-of-two-arrays This is very fast, still, any alternative suggestion is welcome – The August Nov 29 '22 at 15:35

0 Answers0