Your sample output suggests that you only want one index per subsequence but your description implies that you'll probably need all indexes.
Here is a function that will build a dictionary with all indexes for the subsequences (of a given length) that are present:
from collections import defaultdict
def sequenceDict(seq,count):
result = defaultdict(list)
for i,subseq in enumerate(zip(*(seq[i:] for i in range(count)))):
result["".join(subseq)].append(i)
return result
r = sequenceDict('GACTCGTAGCTAGCTG',3)
print(r)
# {'GAC': [0], 'ACT': [1], 'CTC': [2], 'TCG': [3], 'CGT': [4], 'GTA': [5], 'TAG': [6, 10], 'AGC': [7, 11], 'GCT': [8, 12], 'CTA': [9], 'CTG': [13]})
If you really only want the first index of each 3-letter subsequence, the dictionary can be obtained much faster using a single dictionary comprehension:
from itertools import product
{ ss:sequence.index(ss) for p in product(*["ACGT"]*3)for ss in ["".join(p)] if ss in sequence}
I ran a performance test for a random sequence of 250 million letters and the single index dictionary can be obtained in a few microseconds. Obtaining all the indexes takes a little more than a minute (using the function above):
import time
size = 250_000_000
print("loading sequence...",size)
start = time.time()
import random
sequence = "".join(random.choice("ACGT") for _ in range(size))
print("sequence ready",time.time()-start)
start = time.time()
from itertools import product
seqDict = { ss:sequence.index(ss) for p in product(*["ACGT"]*3)for ss in ["".join(p)] if ss in sequence}
print("\n1st index",time.time()-start)
start = time.time()
r = sequenceDict(sequence,3)
print("\nall indexes",time.time()-start)
output:
loading sequence... 250000000
sequence ready 193.82172107696533
1st index 0.000141143798828125
all indexes 71.74848103523254
Given that the time to load the sequences is much larger thant the time to build the index, you may just forgo storing that index dictionary and just rebuild it each time from the source data (which you seem to still need to load anyway for your process)
You could also just store a dictionary of counts and extract the indexes on demand as you need them:
This function gets the number of occurrences for each subsequence:
from collections import Counter
def countSubSeqs(seq,size):
return Counter("".join(s) for s in zip(*(seq[i:] for i in range(size))))
It runs in roughly the same time as the sequenceDict function but produces a much smaller dictionary.
To get the indexes of a specific subsequence (including overlapping positions) you can use this:
subSeq = "ACT"
indexes = [ i for i in range(len(sequence)) if sequence[i:i+3]==subSeq ]
If you don't need all the indexes for all the subsequences right away, you could probably structure your code accordingly and only get the indexes when needed (and possibly store them in a dictionary for chaching and reuse)