How to efficiently iterate through a large string and record the index of every 3-letter substring?

Question

I have a very large sequence that I read in as a string (250,000,000 letters). The letters are G, A, C, T.

For example:

'GACTCGTAGCTAGCTG'

I would like to create some way to store the index of each 3 letter substring in my sequence to be used in another function later.

For example:

{'GAC': 1, 'ACT': 2 'CTC':3, 'TCG':4 ...}

For my current approach, my problem is that I have not found an efficient way to store the indexes of each 3 letter substring in my sequence. Once I know the indexes for each substring, I will randomly select some of them based on given probabilities I have and change them to another known substring.

I have tried iterating through using a for loop, one 3 letter substring at a time across a sliding window, and saving the positions as dictionary values with the 3-letter substring as the key. Also when I save this dictionary file, I have been using pd.to_csv but it seems very inefficient and creates a 8 GB file. However this take too long for a very large string (250,000,000 letters).

I looks like you are making a dictionary where the keys are three letter substings, but the values are simple numbers. What will you do when the substring appears in multiple places like `AGC` in your example. — Mark, Apr 16 '20 at 21:09
You can do the opposite where it's a list... ['GAC', 'ACT', 'CTC': 'TCG':...] so index 0 is gac, 1 is act... so it's the same but reversed... if you must keep your format, you need a list as a value for each index — mark pedersen, Apr 16 '20 at 21:19
https://stackoverflow.com/questions/39944594/is-there-a-really-efficient-fast-way-to-read-large-text-files-in-python maybe the problem is not in python but rather the data is too large to begin with? I also recommand that you use the position of the substring as the key instead of the value. — Khaled Adrani, Apr 16 '20 at 21:30

score 0 · Answer 1 · answered Apr 16 '20 at 21:20

You can split the long string into 3-letter substrings:

string='GACTCGTAGCTAGCT'
substrings=[string[3*x:3*x+3] for x in range(int(len(string)/3))]

substrings will be:

['GAC', 'TCG', 'TAG', 'CTA', 'GCT']

Get the indicies of these to another list:

indices=[x for x in range(int(len(string)/3))]

This will just simply yield:

[0, 1, 2, 3, 4]

The nth element in substring list will correspond to the nth element in the index list.

For how to put the file to a string variable you might want to look at: How to read a text file into a string variable and strip newlines?

Alain T. · Answer 2 · 2020-04-16T23:01:09.667

Your sample output suggests that you only want one index per subsequence but your description implies that you'll probably need all indexes.

Here is a function that will build a dictionary with all indexes for the subsequences (of a given length) that are present:

from collections import defaultdict
def sequenceDict(seq,count):
    result = defaultdict(list)
    for i,subseq in enumerate(zip(*(seq[i:] for i in range(count)))):
        result["".join(subseq)].append(i)
    return result

r = sequenceDict('GACTCGTAGCTAGCTG',3)
print(r)

# {'GAC': [0], 'ACT': [1], 'CTC': [2], 'TCG': [3], 'CGT': [4], 'GTA': [5], 'TAG': [6, 10], 'AGC': [7, 11], 'GCT': [8, 12], 'CTA': [9], 'CTG': [13]})

If you really only want the first index of each 3-letter subsequence, the dictionary can be obtained much faster using a single dictionary comprehension:

from itertools import product
{ ss:sequence.index(ss) for p in product(*["ACGT"]*3)for ss in ["".join(p)] if ss in sequence}

I ran a performance test for a random sequence of 250 million letters and the single index dictionary can be obtained in a few microseconds. Obtaining all the indexes takes a little more than a minute (using the function above):

import time

size = 250_000_000
print("loading sequence...",size)
start = time.time()
import random
sequence = "".join(random.choice("ACGT") for _ in range(size))
print("sequence ready",time.time()-start)


start = time.time()
from itertools import product
seqDict = { ss:sequence.index(ss) for p in product(*["ACGT"]*3)for ss in ["".join(p)] if ss in sequence}
print("\n1st index",time.time()-start)

start = time.time()
r = sequenceDict(sequence,3)
print("\nall indexes",time.time()-start)

output:

loading sequence... 250000000
sequence ready 193.82172107696533

1st index 0.000141143798828125

all indexes 71.74848103523254

Given that the time to load the sequences is much larger thant the time to build the index, you may just forgo storing that index dictionary and just rebuild it each time from the source data (which you seem to still need to load anyway for your process)

You could also just store a dictionary of counts and extract the indexes on demand as you need them:

This function gets the number of occurrences for each subsequence:

from collections import Counter
def countSubSeqs(seq,size):
    return Counter("".join(s) for s in zip(*(seq[i:] for i in range(size))))

It runs in roughly the same time as the sequenceDict function but produces a much smaller dictionary.

To get the indexes of a specific subsequence (including overlapping positions) you can use this:

subSeq  = "ACT"
indexes = [ i for i in range(len(sequence)) if sequence[i:i+3]==subSeq ]

If you don't need all the indexes for all the subsequences right away, you could probably structure your code accordingly and only get the indexes when needed (and possibly store them in a dictionary for chaching and reuse)

How to efficiently iterate through a large string and record the index of every 3-letter substring?

2 Answers2