14

Bear with me, I can't include my 1,000+ line program, and there are a couple of questions in the description.

So I have a couple types of patterns I am searching for:

#literally just a regular word
re.search("Word", arg)

#Varying complex pattern
re.search("[0-9]{2,6}-[0-9]{2}-[0-9]{1}", arg)

#Words with varying cases and the possibility of ending special characters 
re.search("Supplier [Aa]ddress:?|Supplier [Ii]dentification:?|Supplier [Nn]ame:?", arg)

#I also use re.findall for the above patterns as well
re.findall("uses patterns above", arg

I have about 75 of these in total, and some need to be moved to deeply nested functions

When and where should I compile the patterns?

Right now I am trying to improve my program by compiling everything in main, then pass the correct list of compiled RegexObjects to the function that uses it. Would this increase my performance?

Would doing something like the following increase the speed of my program?

re.compile("pattern").search(arg)

Does the compiled patterns stay in memory so if a function is called multiple times with this in it would it skip the compiling part? So I wouldn't have to move data from function to function.

Is it even worth compiling all of the patterns if I move the data so much?

Is there a better way to match regular words without regex?

Short example of my code:

import re

def foo(arg, allWords):
   #Does some things with arg, then puts the result into a variable, 
   # this function does not use allWords

   data = arg #This is the manipulated version of arg

   return(bar(data, allWords))


def bar(data, allWords):
   if allWords[0].search(data) != None:
      temp = data.split("word1", 1)[1]
      return(temp)

   elif allWords[1].search(data) != None:
      temp = data.split("word2", 1)[1]
      return(temp)


def main():

   allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

   arg = "This is a very long string from a text document input, the provided patterns might not be word1 in this string but I need to check for them, and if they are there do some cool things word3"

   #This loop runs a couple million times 
   # because it loops through a couple million text documents
   while True:
      data = foo(arg, allWords)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
SPYBUG96
  • 1,089
  • 5
  • 20
  • 38

2 Answers2

16

This is a tricky subject: many answers, even some legitimate sources such as David Beazley's Python Cookbook, will tell you something like:

[Use compile()] when you’re going to perform a lot of matches using the same pattern. This lets you compile the regex only once versus at each match. [see p. 45 of that book]

However, that really hasn't been true since sometime around Python 2.5. Here's a note straight out of the re docs:

Note The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

There are two small arguments against this, but (anecdotally speaking) these won't result in noticeable timing differences the majority of the time:

  • The size of the cache is limited.
  • Using compiled expressions directly avoids the cache lookup overhead.

Here's a rudimentary test of the above using the 20 newsgroups text dataset. On a relative basis, the improvement in speed is about 1.6% with compiling, presumably due mostly to cache lookup.

import re
from sklearn.datasets import fetch_20newsgroups

# A list of length ~20,000, paragraphs of text
news = fetch_20newsgroups(subset='all', random_state=444).data

# The tokenizer used by most text-processing vectorizers such as TF-IDF
regex = r'(?u)\b\w\w+\b'
regex_comp = re.compile(regex)


def no_compile():
    for text in news:
        re.findall(regex, text)


def with_compile():
    for text in news:
        regex_comp.findall(text)

%timeit -r 3 -n 5 no_compile()
1.78 s ± 16.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)

%timeit -r 3 -n 5 with_compile()
1.75 s ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)

That really only leaves one very defensible reason to use re.compile():

By precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action. [source; p. 15]. It's not uncommon to see constants declared at the top of a module with compile. For example, in smtplib you'll find OLDSTYLE_AUTH = re.compile(r"auth=(.*)", re.I).

Note that compiling happens (eventually) whether or not you use re.compile(). When you do use compile(), you're compiling the passed regex at that moment. If you use the module-level functions like re.search(), you're compiling and searching in this one call. The two processes below are equivalent in this regard:

# with re.compile - gets you a regular expression object (class)
#     and then call its method, `.search()`.
a = re.compile('regex[es|p]')  # compiling happens now
a.search('regexp')             # searching happens now

# with module-level function
re.search('regex[es|p]', 'regexp')  # compiling and searching both happen here

Lastly you asked,

Is there a better way to match regular words without regex?

Yes; this is mentioned as a "common problem" in the HOWTO:

Sometimes using the re module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the IGNORECASE flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine. [emphasis added]

...

In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method.

Community
  • 1
  • 1
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
6

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):
   if allWords[0].search(data):
      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
      return(temp)

   elif allWords[1].search(data):
      temp = data.split("word2", 1)[1]
      return(temp)

can be rewritten simply as

def bar(data, allWords):
   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

  • regular expression is compiled at start, so it's as fast as it can be
  • there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
  • the match & the split are done in one operation

The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • I understand what you suggest, but I'm not sure that building an alternation with many items to search a long string is a good idea. 1) items have to be sorted by length in reverse order first. 2) re is unable to optimize this kind of pattern design before the "engine walk". 3) in the "engine walk", all alternatives are tested at each position in the string that doesn't match. – Casimir et Hippolyte Nov 13 '17 at 16:32
  • I see what you mean: the search is still linear, so if there are a lot of regexes it can be slow. the alternative to this would be a taylor-made parsing that would mean knowing the patterns in advance and optimize the parsing without regexes (at least at start). I'd say that length in reverse is needed if there's a risk of matching 2 types of expressions, which doesn't seem to be the case for OP (check regexes in the question). I'll edit to add a synthesis of the issues you raised. – Jean-François Fabre Nov 13 '17 at 19:38
  • 1
    It's indeed hard to suggest a way without knowing the patterns (or what they are supposed to find) and the strings (size and format if any). When I see this kind of question (with a list of patterns), I become suspicious about the upstream approach, or I think that people has missed some details in the string format. But in the worst case, when script languages starts to reach their limits, perhaps DBMS and eventually tools like elasticsearch are the way to go. – Casimir et Hippolyte Nov 13 '17 at 21:12