2

My programming is almost all self taught, so I apologise in advance if some of my terminology is off in this question. Also, I am going to use a simple example to help illustrate my question, but please note that the example itself is not important, its just a way to hopefully make my question clearer.

Imagine that I have some poorly formatted text with a lot of extra white space that I want to clean up. So I create a function that will replace any groups of white space characters that has a new line character in it with a single new line character and any other groups of white space characters with a single space. The function might look like this

def white_space_cleaner(text):
    new_line_finder = re.compile(r"\s*\n\s*")
    white_space_finder = re.compile(r"\s\s+")
    text = new_line_finder.sub("\n", text)
    text = white_space_finder.sub(" ", text)
    return text

That works just fine, the problem is that now every time I call the function it has to compile the regular expressions. To make it run faster I can rewrite it like this

new_line_finder = re.compile(r"\s*\n\s*")
white_space_finder = re.compile(r"\s\s+")
def white_space_cleaner(text):
    text = new_line_finder.sub("\n", text)
    text = white_space_finder.sub(" ", text)
    return text

Now the regular expressions are only compiled once and the function runs faster. Using timeit on both functions I find that the first function takes 27.3 µs per loop and the second takes 25.5 µs per loop. A small speed up, but one that could be significant if the function is called millions of time or has hundreds of patterns instead of 2. Of course, the downside of the second function is that it pollutes the global namespace and makes the code less readable. Is there some "Pythonic" way to include an object, like a compiled regular expression, in a function without having it be recompiled every time the function is called?

Joseph Stover
  • 397
  • 4
  • 13
  • 2
    What you're doing is just fine. Sometimes a global variable is the best solution, especially when it's a static value that doesn't get edited. Problems usually arrive when the variable's value is changed from elsewhere while you don't expect it to get changed. This shouldn't be a problem with read-only varaibles like your compiled regular expressions values. – Markus Meskanen Aug 06 '15 at 18:05

5 Answers5

5

Keep a list of tuples (regular expressions and the replacement text) to apply; there doesn't seem to be a compelling need to name each one individually.

finders = [
    (re.compile(r"\s*\n\s*"), "\n"),
    (re.compile(r"\s\s+"), " ")
]
def white_space_cleaner(text):
    for finder, repl in finders:
        text = finder.sub(repl, text)
    return text

You might also incorporate functools.partial:

from functools import partial
replacers = {
    r"\s*\n\s*": "\n",
    r"\s\s+": " "
}
# Ugly boiler-plate, but the only thing you might need to modify
# is the dict above as your needs change.
replacers = [partial(re.compile(regex).sub, repl) for regex, repl in replacers.iteritems()]


def white_space_cleaner(text):
    for replacer in replacers:
        text = replacer(text)
    return text
chepner
  • 497,756
  • 71
  • 530
  • 681
  • This doesn't quite work for this example since you are subbing different values depending on whether you find a newline character or not. Building off your idea you might use a dictionary I guess, but that would definitely be less concise and readable than the example you gave. – Joseph Stover Aug 06 '15 at 18:39
  • Oh, right, I missed that. I can fix that, and provide an alternative that might be more readable (although the fix isn't *too* bad). – chepner Aug 06 '15 at 18:50
3

Another way to do it is to group the common functionality in a class:

class ReUtils(object):
    new_line_finder = re.compile(r"\s*\n\s*")
    white_space_finder = re.compile(r"\s\s+")

    @classmethod
    def white_space_cleaner(cls, text):
        text = cls.new_line_finder.sub("\n", text)
        text = cls.white_space_finder.sub(" ", text)
        return text

if __name__ == '__main__':
   print ReUtils.white_space_cleaner("the text")

It's already grouped in a module, but depending on the rest of the code a class can also be suitable.

mzc
  • 3,265
  • 1
  • 20
  • 25
  • I checked this answer because it achieves the speed up and could be easily extended to a much more complicated example, but the answer by chepner also achieved the speed up and might be clearer in situations where you are doing the exact same thing for several different patterns and the answer by by AlexVan Liew also worked and might be cleaner for situations where there are a small number of patterns or where you need the flexibility to change the patterns for certain function calls, so be sure to check those out as well. Thanks! – Joseph Stover Aug 06 '15 at 19:03
2

You could put the regular expression compilation into the function parameters, like this:

def white_space_finder(text, new_line_finder=re.compile(r"\s*\n\s*"),
                             white_space_finder=re.compile(r"\s\s+")):
    text = new_line_finder.sub("\n", text)
    text = white_space_finder.sub(" ", text)
    return text

Since default function arguments are evaluated when the function is parsed, they'll only be loaded once and they won't be in the module namespace. They also give you the flexibility to replace those from calling code if you really need to. The downside is that some people might consider it to be polluting the function signature.

I wanted to try timing this but I couldn't figure out how to use timeit properly. You should see similar results to the global version.

Markus's comment on your post is correct, though; sometimes it's fine to put variables at module-level. If you don't want them to be easily visible to other modules, though, consider prepending the names with an underscore; this marks them as module-private and if you do from module import * it won't import names starting with an underscore (you can still get them if you ask from them by name, though).

Always remember; the end-all to "what's the best way to do this in Python" is almost always "what makes the code most readable?" Python was created, first and foremost, to be easy to read, so do what you think is the most readable thing.

Alex Van Liew
  • 1,339
  • 9
  • 15
1

In this particular case I think it doesn't matter. Check:

Is it worth using Python's re.compile?

As you can see in the answer, and in the source code:

https://github.com/python/cpython/blob/master/Lib/re.py#L281

The implementation of the re module has a cache of the regular expression itself. So, the small speed up you see is probably because you avoid the lookup for the cache.

Now, as with the question, sometimes doing something like this is very relevant like, again, building a internal cache that remains namespaced to the function.

def heavy_processing(arg):
    return arg + 2

def myfunc(arg1):
    # Assign attribute to function if first call
    if not hasattr(myfunc, 'cache'):
        myfunc.cache = {}

    # Perform lookup in internal cache
    if arg1 in myfunc.cache:
        return myfunc.cache[arg1]

    # Very heavy and expensive processing with arg1
    result = heavy_processing(arg1)
    myfunc.cache[arg1] = result
    return result

And this is executed like this:

>>> myfunc.cache
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'cache'
>>> myfunc(10)
12
>>> myfunc.cache
{10: 12}
Community
  • 1
  • 1
Havok
  • 5,776
  • 1
  • 35
  • 44
-1

You can use a static function attribute to hold the compiled re. This example does something similar, keeping a translation table in one function attribute.

def static_var(varname, value):
    def decorate(func):
        setattr(func, varname, value)
        return func
    return decorate

@static_var("complements", str.maketrans('acgtACGT', 'tgcaTGCA'))
def rc(seq):
    return seq.translate(rc.complements)[::-1]
A.P.
  • 1,109
  • 8
  • 6