1

I am parsing sql WHERE clauses and I have the following javascript (regex)

(?:(?:(between )(['"]?)(.*?)(\2)( and )(['"]?)(.*?)(\6)))

That I am matching against

id BETWEEN 3 and 10

In order for this regex to work, I have to add \s or \s+ at the end of the regex and include a space at the end of the string being matched.

Can someone explain why this matching of the extra space is necessary to match the 10 part of the string (in capturing group 7)?

Note that this regex is extracted from a larger regex which is used to parse an sql filter:

(\(*)([\w][\w\d.]*)\s*([<>!=]{1,2}|like|not like|is null|is not null|in\s*\()?\s*(?!and|or)(?:(?:(between )(['"]?)(.*?)(\5)( and )(['"]?)(.*?)(\9))|(?:(['"]?)(.*?)(\12)))\s*(\)*)\s+(?!'|")\s*(and|or)?\s*
mseifert
  • 5,390
  • 9
  • 38
  • 100
  • It's actually clear that the 6th group matches an empty string. So, `.*?` is at the end of the pattern, and being a lazy pattern, matches the least amount of characters it can match, that is, zero here. Without further clarification on what you need to match, I cannot suggest anything, but definitely `.*?` must be replaced with something more greedy or you need to add other obligatory patterns after it. What is the real pattern? – Wiktor Stribiżew Oct 15 '16 at 20:46
  • @WiktorStribiżew - I've included the larger pattern and link which is used to parse a mysql filter. – mseifert Oct 15 '16 at 20:54
  • There is a certain issue with this approach, as you cannot use a backreference to the empty string as a boundary. You will need to use alternation and capture `"` and `'` delimited parts into 1 capturing group, and a sequence of non-whitespace into another. – Wiktor Stribiżew Oct 15 '16 at 21:12
  • @WiktorStribiżew - Forgive the probably obvious question. Why is the least amount of characters zero here? And why does adding a space at the end allow it to then match the string `10`? – mseifert Oct 15 '16 at 21:12
  • I will add more to the answer. – Wiktor Stribiżew Oct 15 '16 at 21:17

1 Answers1

1

In (?:(?:(between )(['"]?)(.*?)(\2)( and )(['"]?)(.*?)(\6))), the 6th group - (['"]?) - matches an empty string. So, .*? (the 7th group) appears at the end of the pattern, and being a lazy pattern, matches the least amount of characters it can match, that is, zero.

Consider a regex like /I have a .*?/ and you try it against a I have a cat string (see demo here). The regex finds I have a and then the .*? part - matching any zero or more chars other than linebreak chars as few as possible - matches the empty space right before cat because that is how lazy quantifiers work: rather than match eagerly, they let subsequent patterns match, and only when they fail, the lazy pattern will "expand", i.e. will try to match. That is why the lazy patterns at the end of the pattern match the minimal amount of chars they need to match: .+? will match only 1 char, and .*? will match 0.

See Greedy vs. Reluctant vs. Possessive Quantifiers for more information on how lazy quantifiers work.

As you cannot use a backreference to the empty string as a boundary, you will need to use alternation and capture " and ' delimited substrings into 1 capturing group, and a sequence of non-whitespace into another.

Besdies, the \s+ close to the end of the pattern needs to be changed into \s* to allow the string not to end with whitespace.

(\(*)(\w[\w.]*)\s*([<>!=]{1,2}|like|not like|is null|is not null|in\s*\()?\s*(?!and|or)(?:(?:(between )(?:(['"])(.*?)(\5)|(\S+))( and )(?:(['"])(.*?)(\10)|(\S+)))|(?:(['"])(.*?)(\14)|(\S+)))\s*(\)*)\s*(?!'|")\s*(and|or)?\s*

See this regex demo

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Excellent. I appreciate your explaining `only when they fail, the lazy pattern will "expand"` as well as the subtlety around boundaries. I'm still relatively new at regex and these finer points make all the difference. – mseifert Oct 15 '16 at 21:55