Python3.x: quickest way to replace characters in very large string

Question

Let's say I have the following extremely large string using Python3.x, several GB in size and +10 billion characters in length:

string1 = "XYZYXZZXYZZXYZYXYXZYXZYXZYZYZXY.....YY"

Given its length, this already takes +GB to load into RAM.

I would like to write a function that will replace every X with A, Y with B, and Z with C. My goal is to make this as quick as possible. Naturally, this should be efficient as well (e.g. there may be some RAM trade-offs I'm not sure about).

The most obvious solution for me is to use the string module and string.replace():

import string
def replace_characters(input_string):
    new_string = input_string.replace("X", "A").replace("Y", "B").replace("Z", "C")
    return new_string

foo = replace_characters(string1)
print(foo)

which outputs

'ABCBACCABCCABCBABACBACBACBCBCAB...BB'

I worry this is not the most efficient approach, as I'm simultaneously calling three functions at once on such a large data structure.

What is the most efficient solution for a string this large?

What is the performance the way you do it now? Do you have reason to believe that it is unsatisfactory in some way? — wallyk, Jun 25 '17 at 03:11
@wallyk It's clunky. I think `.replace()` is first passing through the entire string. So, this function is actually three function calls with at least three temporary strings held in memory. It's not terribly efficient. — ShanZhengYang, Jun 25 '17 at 03:31

donkopotamus · Accepted Answer · 2017-06-25T03:32:17.403

7

A more memory efficient method, that will not generate so many temporary strings along the way, would be to use str.translate.

>>> string1 = "XYZYXZZXYZZXYZYXYXZYXZYXZYZYZXY"
>>> string1.translate({ord("X"): "A", ord("Y"): "B", ord("Z"): "C"})
'ABCBACCABCCABCBABACBACBACBCBCAB'

This will allocate just one (extra large in your case) string.

edited Jun 25 '17 at 03:32

answered Jun 25 '17 at 03:25

donkopotamus

22,114
2
48
60

Oh, didn't know about this one. – cs95 Jun 25 '17 at 03:26
@Coldspeed Should be a *lot* faster than a regex I'd expect! – donkopotamus Jun 25 '17 at 03:28
Whoah! Excellent solution – ShanZhengYang Jun 25 '17 at 03:29
or, you could just read the entire data into an array (https://docs.python.org/3/library/array.html) and then convert the specific bytes in the array. No need to allocate another huge buffer. – Amnon Aug 12 '20 at 23:28

Python3.x: quickest way to replace characters in very large string

1 Answers1

Linked