Python regular expression using the OR operator

Question

I am trying to parse a large sample of text files with regular expressions (RE). I am trying to extract from these files the part of the text which contains 'vu' and ends with a newline '\n'.

Patterns differ from one file to another, so I tried to look for combinations of RE in my files using the OR operator. However, I did not find a way to automate my code so that the re.findall() function looks for a combination of RE.

Here is an example of how I tried to tackle this issue, but apparently I still can not evaluate both my regular expressions and the OR operator in re.findall():

import re

def series2string(myserie) :
    myserie2 = ' or '.join(serie for serie in myserie)
    return myserie2

def expression(pattern, mystring) : 
    x = re.findall(pattern, mystring)
    if len(x)>0:
        return 1
    else:
        return 0

#text example
text = "\n\n    (troisième chambre)\n    i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"

#expressions to look out
pattern1 = '^\s*vu.*\n'
pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'

pattern = [pattern1, pattern2]
pattern = series2string(pattern)

expression(pattern, text)

Note : I circumvented this problem by looking for each pattern in a for loop but my code would run faster if I could just use re.findall() once.

Markus Jarderot · Accepted Answer · 2015-09-21T13:57:05.067

Python regular expressions uses the | operator for alternation.

def series2string(myserie) :
    myserie2 = '|'.join(serie for serie in myserie)
    myserie2 = '(' + myserie2 + ')'
    return myserie2

More information: https://docs.python.org/3/library/re.html

The individual patterns look really messy, so I don't know what is a mistake, and what is intentional. I am guessing you are looking for the word "vu" in a few different contexts.

Always use Python raw strings for regular expressions, prefixed with r (r'pattern here'). It allows you to use \ in a pattern without python trying to interpret it as a string escape. It is passed directly to the regex engine. (ref)
Use \s to match white-space (spaces and line-breaks).
Since you already have several alternative patterns, don't make ( and ) optional. It can result in catastrophic backtracking, which can make matching large strings really slow.
\(? → \(
\)? → \)
{1} doesn't do anything. It just repeats the previous sub-pattern once, which is the same as not specifying anything.
\br is invalid. It is interpreted as \b (ASCII bell-character) + the letter r.
You have a quote character (') at the beginning of your text-string. Either you intend ^ to match the start of any line, or the ' is a copy/paste error.

Some errors when combining the patterns:

pattern = [pattern1, pattern2, pattern3, pattern4]
pattern = series2string(pattern)

expression(re.compile(pattern), text)

Could you update your question to include the corrected code? Keep the original code there for context, otherwise this answer won't make good sense. — John de Largentaye, Sep 21 '15 at 21:05
What is it that you want to match? How did you get your patterns? — Markus Jarderot, Sep 22 '15 at 09:01

Tanguy · Answer 2 · 2015-09-22T21:07:20.073

Thank you for your tips. My regular expressions were a little clumsy in my first post (I changed them hoping the question would be more understandable). I managed to capture the OR operator '|' thanks to 're.compile' and the code works fine!

import re

def series2string(myserie) :
    myserie2 = '|'.join(serie for serie in myserie)
    return myserie2

def expression(pattern, mystring) : 
    x = re.findall(pattern, mystring)
    if len(x)>0:
        return 1
    else:
        return 0

#text example
text = "\n\n    (troisième chambre)\n    i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"

#expressions to look out
pattern1 = r'^\s*vu.*\n'
pattern2 = r'^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'

pattern = [pattern1, pattern2]
pattern = series2string(pattern)

expression(re.compile(pattern), text)

Python regular expression using the OR operator

2 Answers2

Linked