1
>>> re.match(r'"([^"]|(\\\"))*"', r'"I do not know what \"A\" is"').group(0)
'"I do not know what \\"'
>>> re.match(r'"((\\\")|[^"])*"', r'"I do not know what \"A\" is"').group(0)
'"I do not know what \\"A\\" is"'

These two regexes are intended to be looking for quoted strings, with escaped quote sequences. The difference, unless I am missing something, is the order of the disjunction in the parentheses.

Why do they not both accept the entire string?

Kevin Dolan
  • 4,952
  • 3
  • 35
  • 47
  • I wonder if it's confused by the double grouping, namely, "(outer_group(inner_group))". As far as I'm aware, that doesn't quite work – Gene Burinsky Aug 22 '16 at 22:55
  • What you say is true, the order is different. And something else is different. The first one `"([^"]|(\\\"))*"` will match an escape, making it match `"asdf\"`sde" while the other one doesn't. Also if you have to handle escape quote, you have to handle escapes as well. So, neither one is valid. –  Aug 22 '16 at 23:00
  • And added note that the engine will parse this part `\\\"` into an _escape_ then a _quote_. So it only needs to be this `\\"` in it's literal string form. –  Aug 22 '16 at 23:22

3 Answers3

2

What you say is true, the order is different. And something else is different.
The first one "([^"]|(\\\"))*" will match an escape, making it
match "asdf\"sde" while the other one doesn't.

Also if you have to handle escape quote, you have to handle escapes as well. So, neither one is valid.

Here are two kind of standard ways to do this.
Both handle the escape.
You can extend this to single quotes as well.
Use the Dot-All modifier (?s) if you want to span newlines.

Method 1. - alternation

"(?:\\.|[^"\\]+)*"

 " 
 (?:
      \\ .                 # Escape anything
   |                     # or,
      [^"\\]+              # Not escape not quote
 )*
 "

Method 2. - unrolled loop

"[^"\\]*(?:\\.[^"\\]*)*"

 " 
 [^"\\]*          # Optional not escape not quote
 (?:
      \\ .             # Escape anything
      [^"\\]*          # Optional not escape not quote
 )*
 "

Both do the same. Method 2 is three to five time faster than Method 1.

1

The order in alternation groups matters.

In the first regex the [^"] alternative is tried first for every character. It matches every single character up to (and including) the first \. On the next character (") this alternative ([^"]) fails and the other one (\\\") tried. The latter also fails since "A does not match \\\". This stop the quantifier * from further matches.

In the second regex the \\\" alternative (parenthesis are redundant) tried first for every character and fails so the second alternative ([^"]) matches. But on at the first \ the first alternative matches so the lookup pointer moves past \" to A and lookup goes on.

As a general rule of thumb, place the most narrow expression in alternation first.

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
  • 1
    Another rule of thumb is that no two branches of an alternation should be able to match the same thing. In other words, the character class should exclude the backslash as well as the quote: `"([^"\\]|(\\"))*"`. This can have a dramatic effect on performance as well as correctness. ([ref](http://stackoverflow.com/questions/2407870/javascript-regex-hangs-using-v8)) – Alan Moore Aug 22 '16 at 23:14
  • I was unaware regexes were executed this way, as I was thinking of them as greedy non-deterministic state machines. I think this is the best explanation of specifically why these two regexes produce different results. – Kevin Dolan Aug 23 '16 at 20:16
  • @KevinDolan - An fyi, regular expressions have always been processed left to right. The state machine is constructed in order as the regex is processed. An example is given regex `a|ab|abc` and target string `abc`, the _ab_ and _abc_ will never get matched. Keep that in mind when you write regex. –  Aug 23 '16 at 20:32
0

The regex r'(A|B)' will test first try to match A, and only if that fails will it try to match B (docs)

So the regex ([^"]|(\\\") will first try to match a non-quote, and if that fails it will try to match an escaped quote mark.

So when the regex reaches \"A\" the first part part matches the \ (it is not a quote. But then neither part matches ", so the match ends there. The backslash is gobbled by the [^"] so the second half of the expression is never used.

Turned around ((\\")|[^"]), when it reaches \"A\" will first try to match the \" (it works) then it will try to match A (it matches [^"] and so the match continues.

James K
  • 3,692
  • 1
  • 28
  • 36