0

I have a number of files where I want to replace all instances of a specific string with another one.

I currently have this code:


    mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}

    # Open file for substitution
    replaceFile = open('file', 'r+')

    # read in all the lines
    lines = replaceFile.readlines()

    # seek to the start of the file and truncate
    # (this is cause i want to do an "inline" replace
    replaceFile.seek(0)
    replaceFile.truncate()

    # Loop through each line from file
    for line in lines:
        # Loop through each Key in the mappings dict
        for i in mappings.keys():
            # if the key appears in the line
            if i in line:
                # do replacement
                line = line.replace(i, mappings[i])
        # Write the line to the file and move to next line
        replaceFile.write(line)

This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.

For instance, in the "mappings" dict there are 60728 key value pairs. I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.

There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.

So my question is:

Is there a faster way to do the above? I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.

If you need more info, let me know.

Hurgh
  • 103
  • 1
  • 2
  • The linked duplicate question has answers that cannot be topped. – 000 Sep 12 '13 at 23:11
  • 1
    The linked question has a *much* smaller dictionary of replacements. Given the questions are about performance, I'm baffled by the idea that this difference doesn't distinguish the questions. – Steve Jessop Sep 12 '13 at 23:13
  • It will go a lot faster if you don't do it a line at a time. Since you can read the whole file into memory, just do that and do one replace for each term. – kindall Sep 12 '13 at 23:15
  • @SteveJessop: I think it would be more useful to add another answer there, which repeats Tor Valamo's tests with much larger dictionaries, than to have a whole separate question and answers in parallel… – abarnert Sep 12 '13 at 23:15
  • @abarnert: the only way that's going to usefully happen is if someone answers this question and then there's a merge. Otherwise what you're proposing is that someone sticks on that question, an answer that *doesn't answer the question*, just because a similar but non-duplicate question has been asked later. But I realise there's a tension between making SO "neat and tidy" for future visitors, vs. helping the visitors we actually have now ;-p – Steve Jessop Sep 12 '13 at 23:17
  • 1
    Also, I'm willing to bet the optimization `if i in line: line = line.replace(i, mappings[i])` actually slows you down quite a bit. Think about it: when not found, you're doing a full search so you can skip a full search, to no benefit; when found, you're doing the search twice instead of once. – abarnert Sep 12 '13 at 23:18
  • @SteveJessop: I think an answer for larger dictionaries would be useful on that question. Someone searching for "mass string replace" and finding answers for mass~=40 and mass~=60000 is more likely to be happy than someone who only finds answers for mass~=40, right? – abarnert Sep 12 '13 at 23:20
  • @abarnert: true -- the person asking that question did the right thing by stating their real problem etc, but arguably it would be useful to have a later curation step in which a question and its answers are generalized outwards together. Thing is, if this question is quickly closed as a dupe then very few people will trouble to read it, realise that new answers are required on an old question, and provide them. Wikipedia this ain't. – Steve Jessop Sep 12 '13 at 23:21
  • @SteveJessop: Also, other than the size-specific timing information, most of the answers there are pretty general—and in fact some, like the suggestion to use Aho-Corasick—are probably more useful here than in that original question. – abarnert Sep 12 '13 at 23:23
  • 1. What is the distribution of the replacement key lengths? 2. Is it guaranteed that the replacements won't create something that could match one of the originals again? In other words, does the order of replacements matter? – Sven Marnach Sep 12 '13 at 23:31
  • There are solutions here using both regular expressions and string substitutions. You should benchmark the two different approaches to see which is best for your specific use-case. – bpmason1 Sep 13 '13 at 00:20
  • I have looked over the proposed duplicate answer link, and none o the solutions in there are any faster for a list of my size. – Hurgh Sep 13 '13 at 01:13
  • So far out of the solutions (and my testing on the final system), the fastest one is the one offered by **cjrh** – Hurgh Sep 13 '13 at 01:15

4 Answers4

1

According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :

import sys, re, time, hashlib

class Regex:

    # Regex implementation of find/replace for a massive word list.

    def __init__(self, mappings):
        self._mappings = mappings

    def replace_func(self, matchObj):
        key = matchObj.group(0)
        if self._mappings.has_key(key):
            return self._mappings[key]
        else:
            return key

    def replace_all(self, filename):
        text = ''
        with open(filename, 'r+') as fp
            text = fp.read()
        text = re.sub("[a-zA-Z]+", self.replace_func, text)
        fp = with open(filename, "w") as fp:
            fp.write(text)

# mapping dictionary of (find, replace) tuples defined 
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}

# initialize regex class with mapping tuple dictionary
r = Regex(mappings)

# replace file
r.replace_all( 'file' )
vfxdrummer
  • 436
  • 4
  • 7
  • There are multiple trie implementations on PyPI, as well as bindings for C and C++ libraries like Marisa, so a trie algorithm is almost certainly faster in Python too, unless you insist on actually implementing the trie in pure Python. – abarnert Sep 13 '13 at 01:31
  • Yes, very true. I meant implementing a trie in Python would be slower than just using regex. I was assuming that my answer had to be comprised of only python. I take note of using C/C++ bindings. Thank You for the excellent information. – vfxdrummer Sep 13 '13 at 02:33
1

If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:

for filename in filenames:
    with open(filename, 'r+') as f:
        data = f.read()
        f.seek(0)
        f.truncate()
        for k, v in mappings.items():
            data = data.replace(k, v)
        f.write(data)

Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.

Apparently str.replace is faster than regex.sub.


So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by @abarnert).

Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).

So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:

import multiprocessing as mp

def split_seq(seq, num_pieces):
    # Splits a list into pieces
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop   

def detect_active_keys(keys, data, queue):
    # This function MUST be at the top-level, or
    # it can't be pickled (multiprocessing using pickling)
    queue.put([k for k in keys if k in data])

def mass_replace(data, mappings):
    manager = mp.Manager()
    queue = mp.Queue()
    # Data will be SHARED (not duplicated for each process)
    d = manager.list(data) 

    # Split the MAPPINGS KEYS up into multiple LISTS, 
    # same number as CPUs
    key_batches = split_seq(mappings.keys(), mp.cpu_count())

    # Start the key detections
    processes = []
    for i, keys in enumerate(key_batches):
        p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
        # This is non-blocking
        p.start()
        processes.append(p)

    # Consume the output from the queues
    active_keys = []
    for p in processes:
        # We expect one result per process exactly
        # (this is blocking)
        active_keys.append(queue.get())

    # Wait for the processes to finish
    for p in processes:
        # Note that you MUST only call join() after
        # calling queue.get()
        p.join()

    # Same as original submission, now with MUCH fewer keys
    for key in active_keys:
        data = data.replace(k, mappings[key])

    return data

if __name__ == '__main__':
    # You MUST call the mass_replace function from
    # here, due to how multiprocessing works
    filenames = <...obtain filenames...>
    mappings = <...obtain mappings...>
    for filename in filenames:
        with open(filename, 'r+') as f:
            data = mass_replace(f.read(), mappings)
            f.seek(0)
            f.truncate()
            f.write(data)

Some notes:

  • I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
  • Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
  • This code is still pretty simple. I would love to know whether it improves your speed at all.
  • There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
Caleb Hattingh
  • 9,005
  • 2
  • 31
  • 44
  • This doesn't address the possibility that a key might appear multiple times within the same file. – bpmason1 Sep 13 '13 at 00:09
  • @bpmason1: `string.replace` does them all. I didn't pick up from the OP that limited replacements were required? In his example, the replacements are done in the same way? – Caleb Hattingh Sep 13 '13 at 00:13
  • I stand corrected. string.replacae does indeed replace all occurrences. I was still thinking abour re.sub because I use a regex to answer the question. – bpmason1 Sep 13 '13 at 00:19
  • `string.replace` is the deprecated function in the `string` module. I think you meant `str.replace`, the method on `str` objects. Especially since that's what you actually used in your code. – abarnert Sep 13 '13 at 01:29
  • @abarnert: yep, fixed. – Caleb Hattingh Sep 13 '13 at 02:46
1

The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)

Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.

So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)

The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.

The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)

Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)

There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.

Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.

Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:

tree = ahocorasick.KeywordTree()
for key in mappings:
    tree.add(key)
tree.make()    
for start, end in reversed(list(tree.findall(target))):
    target = target[:start] + mappings[target[start:end]] + target[end:]
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.

mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}

# Open file for substitution
with open('file', 'r+') as fd:

    # read in all the data
    text = fd.read()

    # seek to the start of the file and truncate so file will be edited inline
    fd.seek(0)
    fd.truncate()

    for key in mappings.keys():
        text = text.replace(key, mappings[key])

    fd.write(text)
bpmason1
  • 1,900
  • 1
  • 12
  • 9
  • Setting `count` to `sys.maxint` is silly—the default, or 0, means to replace all occurrences. Also, there's no way a regex can beat a plain string search for plain string patterns, so unless you suspect there's something wrong with `str.replace`, figuring out how to mimic it with `re.sub` is pointless in the first place. Finally, even if the file text can't be held in memory, as long as it can be mapped into VM (via `mmap`) you can search it all at once with the same code. (That being said, it might be faster to search segments that do fit into memory.) – abarnert Sep 13 '13 at 18:10