Replace and += is abismally slow

Question

I've made following code that deciphers some byte-arrays into "Readable" text for a translation project.

with open(Path(cur_file), mode="rb") as file:
    contents = file.read()
    file.close()

text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
    byte = contents[i]
    byte_2 = contents[i+1]
    if byte == 0x00 and byte_2 == 0x00:
        text+="[0x00 0x00]"
    elif byte != 0x00 and byte_2 == 0x00:
        #print("Normal byte")
        if chr(byte) in printable:
            text+=chr(byte)
        elif byte == 0x00:
            pass
        else:
            text+="[" + "0x{:02x}".format(byte) + "]"
    else:
        #print("Special byte")
        text+="[" + "0x{:02x}".format(byte) + " " + "0x{:02x}".format(byte_2) + "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name

text = text.replace("[0x0a]","[ENTER]") # Generic enter

lang_dict[emsbt_key_name] = text

While this code does work and produce output like:

Cancel[0x00 0x00]

And more complex ones, I've stumbled upon a performance problem when I loop it through 60000 files.

I've read a couple of questions regarding += with large strings and people say that join is preferred with large strings. However, even with strings of just under 1000 characters, a single file takes about 5 seconds to store, which is a lot.

I almost feel like it's starts fast and gets progressively slower and slower.

What would be a way to optimize this code? I feel it's also abysmal.

Any help or clue is greatly appreciated.

EDIT: Added cProfile output:

         261207623 function calls (261180607 primitive calls) in 95.364 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    284/1    0.002    0.000   95.365   95.365 {built-in method builtins.exec}
        1    0.000    0.000   95.365   95.365 start.py:1(<module>)
        1    0.610    0.610   94.917   94.917 emsbt_to_json.py:21(to_json)
    11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}
 62501129   49.127    0.000   74.146    0.000 pathlib.py:578(__eq__)
125048857   18.401    0.000   18.863    0.000 pathlib.py:569(_cparts)
 63734640    6.822    0.000    6.828    0.000 {built-in method builtins.isinstance}
   160958    0.183    0.000    4.170    0.000 pathlib.py:504(_from_parts)
   160958    0.713    0.000    3.942    0.000 pathlib.py:484(_parse_args)
    68959    0.110    0.000    3.769    0.000 pathlib.py:971(absolute)
   160959    1.600    0.000    2.924    0.000 pathlib.py:56(parse_parts)
    91999    0.081    0.000    1.624    0.000 pathlib.py:868(__new__)
    68960    0.028    0.000    1.547    0.000 pathlib.py:956(rglob)
    68960    0.090    0.000    1.518    0.000 pathlib.py:402(_select_from)
    68959    0.067    0.000    1.015    0.000 pathlib.py:902(cwd)
       37    0.001    0.000    0.831    0.022 __init__.py:1(<module>)
   937462    0.766    0.000    0.798    0.000 pathlib.py:147(splitroot)
    11810    0.745    0.000    0.745    0.000 {method '__exit__' of '_io._IOBase' objects}
   137918    0.143    0.000    0.658    0.000 pathlib.py:583(__hash__)

EDIT: Upon further inspection with line_profiler, turns out that the culprit isn't even in above code. It's well outside that code where I read search over the indexes to see if there is +1 file (looking ahead of the index). This apparently consumes a whole lot of CPU time.

You can use `re.sub()` to perform multiple replacements all at once. — Barmar, Jan 04 '23 at 19:59
Instead of concatenating strings, append to a list. Then at the end, use `''.join(list_of_strings)` to concatenate them all at once. — Barmar, Jan 04 '23 at 20:00
Have you profiled to find out what section of code is "slow"? — SethMMorton, Jan 04 '23 at 20:00
If the file is encoded in UTF-16, then you should open it in text mode, with the encoding specified as UTF-16. All of your slow byte-by-byte processing simply goes away. — jasonharper, Jan 04 '23 at 20:01
Thanks for all the comments so far. @SethMMorton I'm running a profiler as we speak, but it's taking so long to complete, that I posted the question in the meantime. — Fusseldieb, Jan 04 '23 at 20:01
I have a hard time believing this code takes 10 seconds for a 150-byte file. — Barmar, Jan 04 '23 at 20:02
@jasonharper It's not pure UTF16, there are some bytes in between that aren't readable text and I store them as literal [byte1 byte2]. At the end, I save it all into several JSON files (they are small as in 5kb). If I don't do that, the JSON would end up with a lot of strange characters. — Fusseldieb, Jan 04 '23 at 20:02
as for your string creations (don't know what python version you are using) but f-strings are preferred for string manipulation and might help with performance with instead of `+` strings together. — Andrew Ryan, Jan 04 '23 at 20:05
@Barmar Interesting approach. I may give it a try and will report back! — Fusseldieb, Jan 04 '23 at 20:07
@AndrewRyan I'm using 3.11. I already use f-strings a lot in my code, but don't know how it would help me in this particular case... — Fusseldieb, Jan 04 '23 at 20:12
@Fusseldieb for when you are doing `"[" + "0x{:02x}".format(byte) + "]"` and `"[" + "0x{:02x}".format(byte) + " " + "0x{:02x}".format(byte_2) + "]"` as these are both string concatenations (I am assuming that most of the bytes that you are changing are being formatted here in which case this would improve some speed), though you should look into @Bamar's list method. — Andrew Ryan, Jan 04 '23 at 20:18
Is there any way to get some sample data and the corresponding expected result? — JonSG, Jan 04 '23 at 20:54
@JonSG There isn't at the moment, but I figured out what was bogging down the speed. I will answer my own questions AND upvote all the people that suggested changes in the current code to make it even faster. All in all I'm extremely grateful to everyone who commented! Thank you. — Fusseldieb, Jan 04 '23 at 21:48
An issue where the underlying problem turned out to be outside the scope of the given [mre] seems a shoo-in for "not reproducible" as a close reason. — Charles Duffy, Jan 15 '23 at 20:32
@CharlesDuffy I just discovered this after the fact. This question could indeed be closed, if needed. — Fusseldieb, Jan 17 '23 at 05:09

score 0 · Answer 1 · answered Jan 04 '23 at 20:06

Just in case it provides you pathways to search, if I was in your case I'd do two separate checks over 100 files for example timing:

How much time it takes to execute only the for loop.
How much it takes to do only the six replaces.

If any takes most of the total time, I'd try to find a solution just for that bit. For raw replacements there are specific software designed for massive replacements. I hope it helps in some way.

score 0 · Answer 2 · answered Jan 04 '23 at 20:25

You might use .format to replace += and + following way, let say you have code like this

text = ""
for i in range(10):
    text += "[" + "{}".format(i) + "]"
print(text)  # [0][1][2][3][4][5][6][7][8][9]

which is equvialent to

text = ""
for i in range(10):
    text = "{}[{}]".format(text, i)
print(text)  # [0][1][2][3][4][5][6][7][8][9]

Observe that other string formatting ways might be used as above, I elected to use .format as you are already using it.

Fusseldieb · Accepted Answer · 2023-01-15T20:31:31.647

Turns out, prior to this I was looking up a entry in my list (on each iteration) with the index method +1 (to see if there was a path change), which really did bog down the performance.

In the cProfile we can clearly see it:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}

It wasn't .replace! It wasn't even included in my question.

What really made me understand what this call was (other than it called index somehow), was another profiler:

I believe that's what Robert Kern's line_profiler is intended for.

Source: https://stackoverflow.com/a/3927671/3525780

It showed me neatly, line-by-line, which code consumed how many CPU-cycles/time, much neater than cProfile.

Once I found out, I replaced it with:

for ind, cur_file in enumerate(to_write):
        next_file = None
        if ind < len(to_write) - 1:
            next_file = to_write[ind+1]

This answer probably doesn't make much sense without the actual code, but I will leave it here nonetheless.

Replace and += is abismally slow

3 Answers3