0

What is the fastest way to replace certain characters in a given string other than using str.translate()?

Given a sequence that only consists of letters "A", "T", "G", and "C", I want to replace each instance of "A" with "T", "T" with "A", "C" with "G", and "G" with "C". To do this, I used an ascii dictionary map = {65:84,84:65,71:67,67:71}, and do sequence.translate(map). However, in Python 3.8 this appears to be slow. I saw people mention using byte or bytearray to do this, but I just don't know how to make it work.

It looks like I first need to encode the sequence using sequence.encode('ascii', 'ignore') and then use translate() to do the translation?

Can anybody please help me?

For example,

sequence = 'ATGCGTGCGCGACTTT'
# {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
map_dict = {65:84,84:65,71:67,67:71}
# expect 'TACGCACGCGCTGAAA'
sequence.translate(map_dict)
sensationti
  • 195
  • 1
  • 9
  • 1
    Do a favor, add some example code to make the view better of the case, adding the expected result. – Digital Farmer Jun 18 '22 at 01:47
  • It's not good practice to use python inbuilt function names for variables (`map`) – Nick Jun 18 '22 at 01:52
  • does your version of python not use ```"".replace("s","r")``` –  Jun 18 '22 at 01:54
  • @MATOS Replace cannot work here, because X goes to Y and something else goes to X. – Tim Biegeleisen Jun 18 '22 at 01:57
  • How does that limit anything replace just replaces every occurrence of that letter to one of your choice –  Jun 18 '22 at 02:00
  • OH I get what you are saying. –  Jun 18 '22 at 02:00
  • @MATOS Tim is right here, if you replace every A with T at the first place, in the second round, every T will be replaced to A, so it is not intended. – sensationti Jun 18 '22 at 02:02
  • @DigitalFarmer Thanks, my code has been attached in the question. – sensationti Jun 18 '22 at 02:05
  • Hi @sensationti I decided to delete my question and point you to this question which has generated a lot of responses and debates regarding timing for replacements → https://stackoverflow.com/questions/3411771/best-way-to-replace-multiple-characters-in-a-string – Digital Farmer Jun 18 '22 at 02:30
  • Im away from my computer so i dont want write the tio to time it on mobile, but maybe something [like this](https://tio.run/##K6gsycjPM/7/PyW5RMFWoVrdUd1KPURdRwFIWAE5QIYzkOEOYrgDGc7qtVzFqYWlqXnJqUD16o4h7s7uIOzs7ugcEhKizlVsq6Skl5WfmacRDTQzOjlWIS2/SCFZITNPAaYxVpOroCgzr0SjWPP/fwA) – 0x263A Jun 18 '22 at 03:37
  • Update, [nope](https://tio.run/##bZDNioMwFIX3PsXFjXGQokyrJeBCssgLZFfKkGpkMpjExuuilHl2J/0ROswEEs6Fc85H7njBT2ffl6X3zgBqozSCNqPz@JyiqFM9TEYP6kImdZ6VbVVKIwinaxFquCZNQhORZBAeGoYgWBD8JngQLPm@273C2VuI482X05YcQvzQHqF3HlrQFtb2Y/qEKjtJ1M6ifiHDvcvI8aPTD365o/tttt/ScpdVBS2rrKxoVfyCrvkNemmnQaIia0WgTXXcCM747TLeMCFE/FbkeRSNXlskj1WQQZpTJ@m6jDQDO5uT8nWw5nma/u9@/cWfyLL8AA) – 0x263A Jun 18 '22 at 04:06

1 Answers1

0

Assumption here is the sequence is very long, then this should be O(1):

If you maintain an index which contains the position of each letter in the sequence, then you just need to update the index to do bulk replacements.

For example given seq = "AGCTTCGA"

index = {"A": {0, 7}, "G": {1, 6}, "C": {2, 5}, "T": {3, 4}}

and if I understand correctly you want to do a swap:

def swap(index, charA, charB):
    tmp = index[charB]
    index[charB] = index[charA]
    index[charA] = tmp

swap(index, "A", "T")
print(index)
# {'A': {3, 4}, 'G': {1, 6}, 'C': {2, 5}, 'T': {0, 7}}
Doug Shore
  • 771
  • 8
  • 9
  • 1
    If the sequence is very long and the index is unknown, this won't work unless you create 4 lists to store the indices, right? – sensationti Jun 18 '22 at 03:34
  • How would a complete implementation of this be O(1)? I'm pretty sure if you started from OP's input (a string) this would have to be at least O(n) to construct the dictionary then the two swaps then you'd have to reconstruct the string which would be at least another O(n)? Also you can just do `index[charB], index[charA] = index[charA], index[charB]`. Also wouldnt all that overhead would be less efficient than translate? – 0x263A Jun 18 '22 at 04:22
  • I read the problem statement as the swap being expensive, this swap is O(1), just a reference change. Yes you have to construct, but you have to construct any data structure, and this would not be expensive. I was trying to offer a different approach that probably requires more information on how this structure is being used. – Doug Shore Jun 20 '22 at 13:12