2

I need to search for patterns which may have many metacharacters. Currently I use a long regex.

prodObjMatcher=re.compile(r"""^(?P<nodeName>[\w\/\:\[\]\<\>\@\$]+)""", re.S|re.M|re.I|re.X)

(my actual pattern is very long so I just pasted some relevant portion on which I need help)

This is especially painful when I need to write combinations of such patterns in a single re compilation.

Is there a pythonic way for shortening the pattern length?

Yogesh Luthra
  • 175
  • 1
  • 10
  • 1
    Why use `.*?` at the end if it matches an empty string? Also, you do not have to *ever* escape any chars in the character class except for shorthand classes, `^`, `-`, `]`, and ``\``. There are ways to keep even those (except for ``\``) unescaped in the character class. – Wiktor Stribiżew Aug 11 '16 at 12:58
  • 2
    In addition to the comments, this smells like a job for `xml` **parsing** (node name???). – Jan Aug 11 '16 at 13:00
  • @WiktorStribiżew ***my actual pattern is very long so I just pasted some relevant portion on which I need help***.Would be great to get an answer to what was asked. I am not yet an expert on regex in python, so generally escape meta characters. Probably will learn over time on which to escape and which not. – Yogesh Luthra Aug 11 '16 at 13:20

1 Answers1

5

Look, your pattern can be reduced to

r"""^(?P<nodeName>[]\w/:[<>@$]+).*?"""

Note that you do not have to ever escape any non-word character in the character classes, except for shorthand classes, ^, -, ], and \. There are ways to keep even those (except for \) unescaped in the character class:

  • ] at the start of the character class
  • - at the start/end of the character class
  • ^ - should only be escaped if you place it at the start of the character class as a literal symbol.

Outside a character class, you must escape \, [, (, ), +, $, ^, *, ?, ..

Note that / is not a special regex metacharacter in Python regex patterns, and does not have to be escaped.

Use raw string literals when defining your regex patterns to avoid issues (like confusing word boundary r'\b' and a backspace '\b').

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563