Firstly thank you for the thousand answered questions from others I have read and used. but now I need to post my own question. Out of 4000 lines of well working code, the following 6 lines were taking weeks and now take days but I still need this to run way faster. My chunk of particularly SLOW code:
for i in PepCombDict.items():
for x in PepSeqDictFull_NoneUnique_NRP.items():
if x[0] == i[0][0] or x[0] == i[0][1]:
continue
elif(set(x[1]).issubset(set(i[1]))):
subsumablelist.append([i[0][0],i[0][1],x[0]])
PepCombDict
is a dictionary with first entry as follows:
key is a pair of values: (0, 1)
value is list:
['TVMENFVAFVDK', 'KVPQVSTPTLVEVSR', 'DTHKSEIAHR', 'YICDNQDTISSK', 'KQTALVELLK', 'FKDLGEEHFK', 'EYEATLEECCAK', 'AEFVEVTK', 'LVNELTEFAK', 'LVTDLTK', 'LGEYGFQNALIVR', 'LSQKFPK', 'HLVDEPQNLIK', 'YLYEIAR', 'SIVHPSYNSNTLNNDIMLIK', 'SAYPGQITSNMFCAGYLEGGK', 'LKSAASLNSR', 'SAASLNSR', 'SGIQVR', 'NKPGVYTK', 'LGEDNINVVEGNEQFISASK', 'SSGTSYPDVLK', 'APILSDSSCKSAYPGQITSNMFCAGYLEGGK', 'SIVHPSYNSNTLNNDIMLIKLK']
Similarly, PepSeqDictFull_NoneUnique_NRP
is also a dictionary with first entry as follows:
Key is a single value: 9
Value is: list
['VTMQNLNDR', 'QSVEADINGLR', 'QSVEADINGLRR', 'LAADDFR', 'DYSKYYK', 'SKELTTEIDNNIEQISSYK', 'KDAEAWFNEK', 'ISSSKGSLGGGFSSGGFSGGSFSR', .....
The objective of this code block is to find all cases where the list of string sequences found in PepSeqDictFull_NoneUnique_NRP
is a subset of the string sequences found in PepCombDict
; and when found, save the keys (indexes) as a list of three numbers (combination of the two keys).
If someone knows how to make this go way faster, I would greatly appreciate it. I am totally open to changing formats of any part. I learned python with this pandemic and want to stick with python. PepSeqDictFull_NoneUnique_NRP
is always smaller. PepCombDict
can range in size from 2,000,000 to 100,000,000.
if x[0] == i[0][0] or x[0] == i[0][1]: continue
is just there to prevent an entry being found to be a subset of itself.
Best,
Tom