I need speed. Comparing two very large dictionaries: key are indexes original pandas index stored as tuples, values are lists of strings

Question

Firstly thank you for the thousand answered questions from others I have read and used. but now I need to post my own question. Out of 4000 lines of well working code, the following 6 lines were taking weeks and now take days but I still need this to run way faster. My chunk of particularly SLOW code:

    for i in PepCombDict.items():
        for x in PepSeqDictFull_NoneUnique_NRP.items():
            if x[0] == i[0][0] or x[0] == i[0][1]:
                continue
            elif(set(x[1]).issubset(set(i[1]))):
                subsumablelist.append([i[0][0],i[0][1],x[0]])

PepCombDict is a dictionary with first entry as follows: key is a pair of values: (0, 1) value is list:

['TVMENFVAFVDK', 'KVPQVSTPTLVEVSR', 'DTHKSEIAHR', 'YICDNQDTISSK', 'KQTALVELLK', 'FKDLGEEHFK', 'EYEATLEECCAK', 'AEFVEVTK', 'LVNELTEFAK', 'LVTDLTK', 'LGEYGFQNALIVR', 'LSQKFPK', 'HLVDEPQNLIK', 'YLYEIAR', 'SIVHPSYNSNTLNNDIMLIK', 'SAYPGQITSNMFCAGYLEGGK', 'LKSAASLNSR', 'SAASLNSR', 'SGIQVR', 'NKPGVYTK', 'LGEDNINVVEGNEQFISASK', 'SSGTSYPDVLK', 'APILSDSSCKSAYPGQITSNMFCAGYLEGGK', 'SIVHPSYNSNTLNNDIMLIKLK']

Similarly, PepSeqDictFull_NoneUnique_NRP is also a dictionary with first entry as follows: Key is a single value: 9 Value is: list

['VTMQNLNDR', 'QSVEADINGLR', 'QSVEADINGLRR', 'LAADDFR', 'DYSKYYK', 'SKELTTEIDNNIEQISSYK', 'KDAEAWFNEK', 'ISSSKGSLGGGFSSGGFSGGSFSR', .....

The objective of this code block is to find all cases where the list of string sequences found in PepSeqDictFull_NoneUnique_NRP is a subset of the string sequences found in PepCombDict; and when found, save the keys (indexes) as a list of three numbers (combination of the two keys).

If someone knows how to make this go way faster, I would greatly appreciate it. I am totally open to changing formats of any part. I learned python with this pandemic and want to stick with python. PepSeqDictFull_NoneUnique_NRP is always smaller. PepCombDict can range in size from 2,000,000 to 100,000,000.

if x[0] == i[0][0] or x[0] == i[0][1]: continue is just there to prevent an entry being found to be a subset of itself.

Best,

Tom

One obvious improvement is to move `set(i[1])` out of the inner loop. There's no reason to keep creating it over and over in your inner loop. Instead, create it in your outer loop, then reuse it in the inner loop. — Tom Karzes, Jun 27 '21 at 05:24
This link gives a pretty good discussion of different techniques: https://stackoverflow.com/questions/16579085/how-can-i-verify-if-one-list-is-a-subset-of-another — John D, Jun 27 '21 at 05:50

enzo · Accepted Answer · 2021-06-27T05:52:07.493

combs = {
    (0, 1): ['TVMENFVAFVDK', 'KVPQVSTPTLVEVSR', 'DTHKSEIAHR', 'YICDNQDTISSK', 'KQTALVELLK', 'FKDLGEEHFK', 'EYEATLEECCAK', 'AEFVEVTK', 'LVNELTEFAK', 'LVTDLTK', 'LGEYGFQNALIVR', 'LSQKFPK', 'HLVDEPQNLIK', 'YLYEIAR', 'SIVHPSYNSNTLNNDIMLIK', 'SAYPGQITSNMFCAGYLEGGK', 'LKSAASLNSR', 'SAASLNSR', 'SGIQVR', 'NKPGVYTK', 'LGEDNINVVEGNEQFISASK', 'SSGTSYPDVLK', 'APILSDSSCKSAYPGQITSNMFCAGYLEGGK', 'SIVHPSYNSNTLNNDIMLIKLK']
}

seqs = {
    9: ['VTMQNLNDR', 'QSVEADINGLR', 'QSVEADINGLRR', 'LAADDFR', 'DYSKYYK', 'SKELTTEIDNNIEQISSYK', 'KDAEAWFNEK', 'ISSSKGSLGGGFSSGGFSGGSFSR']
}

As @TomKarzes noted, you can create the sets outside the loop:

combs = {k: set(v) for k, v in combs.items()}
seqs = {k: set(v) for k, v in seqs.items()}

You can also make a list comprehension while building subsumablelist:

subsumablelist = [
    [comb_ids[0], comb_ids[1], seq_id]
    for comb_ids, comb in combs.items()
    for seq_id, seq in seqs.items()
    if (seq_id not in comb_ids) and all(subseq in comb for subseq in seq)
]

Using timeit, the time improvement from your code (4.97s) to the proposed code (3.03s) is 39%.

Using seq.issubset(comb) instead of all(subseq in comb for subseq in seq) (2.79s) increases the time improvement to 43%.

I need speed. Comparing two very large dictionaries: key are indexes original pandas index stored as tuples, values are lists of strings

1 Answers1