Simplest improvement: Stop using re
, and change if re.search(pattern, strg):
to if oldstr in strg:
; re
isn't buying you anything here (it's more expensive than simple string search for a fixed string).
Alternatively (and much more complexly), if you know the encoding of the file, you may benefit from using the mmap
module (specifically, with the find
method) to avoid having to load the whole file into memory and decode it when the string is moderately likely to not appear in the input; just pre-encode the search string and search the raw binary data. Note: This will not work for some encodings, where reading raw bytes without alignment might get a false positive, but it will work just fine for self-synchronizing encodings (e.g. UTF-8) or single-byte encodings (e.g. ASCII, latin-1).
Lastly, when rewriting the file, avoid slurping it into memory, then rewriting the original file; on top of making your program die (or run slowly) if the file size exceeds physical RAM, it means that if the program dies after it begins rewriting the file, you've lost data forever. The tempfile
module can be used to make a temporary file in the same dir
as the original file, you can read line by line and replace as you go, writing to the temporary file until you're done. Then just perform an atomic rename from temporary file to the original file name to replace the original file as a single operation (ensuring it's either the new data or the old data, not some intermediate version of the data).
Parallelizing might get you something, but if you're operating against a spinning disk, the I/O contention is more likely to harm than help. The only time I've seen reliable improvements is on network file systems with plenty of bandwidth, but enough latency to warrant running I/O operations in parallel.