1

I have a large set of DNA sequences 1.5 million each having around 1k characters from the set ATCG

I am simulating error mutations which is taking a lot of time to finish. I have identified my bottleneck which is the function that changes he characters of the string:

def f(sequence, indexes_to_mutate):
     seq = list(sequence)
     for i in indexes_to_mutate:
         seq[i] = 'X'

     return ''.join(seq)

Is there a faster way to operate on the string without having to convert to list then back to string.

ooo
  • 673
  • 3
  • 16

1 Answers1

0

As per this answer, the following method would be faster than converting to a list and back:

def f(sequence, indexes_to_mutate):
     for i in indexes_to_mutate:
         new_seq = sequence[:i] + 'X' + sequence[i+1:]

     return new_seq
liamhawkins
  • 1,301
  • 2
  • 12
  • 30
  • This is even less efficient than OP's method because you are creating copy of the entire string (minus one character) for **every** edit being made, rather than just once at the start. – meowgoesthedog Feb 26 '19 at 18:41
  • 1
    I think that depends on how many indexes_to_mutate; if a significant chunk of the string is changing, I agree this is less efficient, but if there are only a few mutation points, I think that more of the copies will be fast byte copies in C. – Sam Hartman Feb 26 '19 at 18:44
  • @SamHartman for large data sets such as 1.5 million amino acids, computational complexity *can* have a significant impact, even with the effect of runtime overhead - and then there's also garbage collection. – meowgoesthedog Feb 26 '19 at 18:46