Faster character replacement using index in python

Question

I have a large set of DNA sequences 1.5 million each having around 1k characters from the set ATCG

I am simulating error mutations which is taking a lot of time to finish. I have identified my bottleneck which is the function that changes he characters of the string:

def f(sequence, indexes_to_mutate):
     seq = list(sequence)
     for i in indexes_to_mutate:
         seq[i] = 'X'

     return ''.join(seq)

Is there a faster way to operate on the string without having to convert to list then back to string.

score 0 · Answer 1 · answered Feb 26 '19 at 18:38

0

As per this answer, the following method would be faster than converting to a list and back:

def f(sequence, indexes_to_mutate):
     for i in indexes_to_mutate:
         new_seq = sequence[:i] + 'X' + sequence[i+1:]

     return new_seq

answered Feb 26 '19 at 18:38

liamhawkins

1,301
2
12
30

This is even less efficient than OP's method because you are creating copy of the entire string (minus one character) for **every** edit being made, rather than just once at the start. – meowgoesthedog Feb 26 '19 at 18:41
1

I think that depends on how many indexes_to_mutate; if a significant chunk of the string is changing, I agree this is less efficient, but if there are only a few mutation points, I think that more of the copies will be fast byte copies in C. – Sam Hartman Feb 26 '19 at 18:44
@SamHartman for large data sets such as 1.5 million amino acids, computational complexity *can* have a significant impact, even with the effect of runtime overhead - and then there's also garbage collection. – meowgoesthedog Feb 26 '19 at 18:46

Faster character replacement using index in python

1 Answers1