Is there a way to read GB size text files faster than my script?

Question

I wrote a python script to read and replace " in Gb size multiple text files in a folder fastly. Is there a way to do this more quickly than my script? Is it possible to dedicate several cpu cores for this script when this script runs?

    import re
    import os

    drc = '/root/tmp'
    pattern = re.compile('"')
    oldstr = '"'
    newstr = ''

    for dirpath, dirname, filename in os.walk(drc):
        for fname in filename:
            path = os.path.join(dirpath, fname) 
            strg = open(path).read() 
            if re.search(pattern, strg):

                strg = strg.replace(oldstr, newstr) 
                f = open(path, 'w') 
                f.write(strg) 
                f.close()

@sshashank124 Previously I used a sed script, but it was slower than this python script. I understood python script is faster than a bash script to do this. — Sandun Dayananda, Jan 22 '20 at 06:11
@sshashank124 #!/bin/bash cd "/root/tmp" sed -i -e 's/"//g' *.TXT rm *.TXTe — Sandun Dayananda, Jan 22 '20 at 06:15
https://stackoverflow.com/questions/11920490/how-do-i-run-os-walk-in-parallel-in-python — Sushant, Jan 22 '20 at 06:21
@Sushant I can't figure out which is the correct one to use in the given link? — Sandun Dayananda, Jan 22 '20 at 06:49

score 2 · Accepted Answer · answered Jan 22 '20 at 06:24

Simplest improvement: Stop using re, and change if re.search(pattern, strg): to if oldstr in strg:; re isn't buying you anything here (it's more expensive than simple string search for a fixed string).

Alternatively (and much more complexly), if you know the encoding of the file, you may benefit from using the mmap module (specifically, with the find method) to avoid having to load the whole file into memory and decode it when the string is moderately likely to not appear in the input; just pre-encode the search string and search the raw binary data. Note: This will not work for some encodings, where reading raw bytes without alignment might get a false positive, but it will work just fine for self-synchronizing encodings (e.g. UTF-8) or single-byte encodings (e.g. ASCII, latin-1).

Lastly, when rewriting the file, avoid slurping it into memory, then rewriting the original file; on top of making your program die (or run slowly) if the file size exceeds physical RAM, it means that if the program dies after it begins rewriting the file, you've lost data forever. The tempfile module can be used to make a temporary file in the same dir as the original file, you can read line by line and replace as you go, writing to the temporary file until you're done. Then just perform an atomic rename from temporary file to the original file name to replace the original file as a single operation (ensuring it's either the new data or the old data, not some intermediate version of the data).

Parallelizing might get you something, but if you're operating against a spinning disk, the I/O contention is more likely to harm than help. The only time I've seen reliable improvements is on network file systems with plenty of bandwidth, but enough latency to warrant running I/O operations in parallel.

they are simple text files(I think UTF-8) and I'm not aware how to modify my code to work in a condition like lack of RAM that you have mentioned? — Sandun Dayananda, Jan 22 '20 at 06:35
@SandunDayananda: Look at [example](https://stackoverflow.com/a/34029605/364696) [code](https://stackoverflow.com/a/33811809/364696) for [the `mmap` module](https://docs.python.org/3/library/mmap.html). It's not particularly difficult, just a bit of boilerplate to go from an open file to a `mmap.mmap` object, then using said object in non-copying ways (e.g. don't slice it, which copies, but `memoryview`s or the `find` method work without copying). — ShadowRanger, Jan 22 '20 at 07:22

Is there a way to read GB size text files faster than my script?

1 Answers1