2

I have a regex that I use to match Expression of the form (val1 operator val2)

This regex looks like :

(\(\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*\))

Which is actually good and matches what I want as you can see here in this demo

BUT :D (here comes the butter)

I want to optimise the regex itself by making it more readable and "Compact". I searched on how to do that and I found something called back-reference, in which you can name your capturing groups and then reference them later as such:

(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*(\g{Val})\s*\))

where I named the group that captures the left side of the expression Val and later I referenced it as (\g{Val}), now the problem is that this expression as you can see here only case where left side of the expression is exactly the same as right side! e.g. (a==a) or (1==1) and does not match expressions such as (a==b)!

Now the question is: is there a way to reference the pattern instead of the matched value?!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ahmad Hajjar
  • 1,796
  • 3
  • 18
  • 33

2 Answers2

7

Note that \g{N} is equivalent to \1, that is, a backreference that matches the same value, not the pattern, that the corresponding capturing group matched. This syntax is a bit more flexible though, since you can define the capture groups that are relative to the current group by using - before the number (i.e. \g{-2}, (\p{L})(\d)\g{-2} will match a1a).

The PCRE engine allows subroutine calls to recurse subpatterns. To repeat the pattern of Group 1, use (?1), and (?&Val) to recurse the pattern of the named group Val.

Also, you may use character classes to match single characters, and consider using ? quantifier to make parts of the regex optional:

(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|[*\/+-]|[=!><]=|[><])\s*((?&Val))\s*\))

See the regex demo

Note that \'.*\' and \[.*\] can match too much, consider replacing with \'[^\']*\' and \[[^][]*\].

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Note that the `(?&Val)` must be wrapped into a separate capturing group to also capture a value, hence `((?&Val))`. – Wiktor Stribiżew Aug 17 '16 at 09:21
  • and thanks for the advice I will change the array and string matchers :) +1 – Ahmad Hajjar Aug 17 '16 at 09:29
  • btw what if I want to match recursive expressions like `((a+1)==(b*5))` .. in this case my expression will match only two expressions `(a+1)` and `(b*5)` how can I adjust it to match the whole expression !? – Ahmad Hajjar Aug 17 '16 at 10:02
  • If you apply recursion, you will lose captures inside it. Do you really want it? Use a basic expression to match all nested parentheses with `/\((?>[^()]++|(?R))*\)/` and then parse each with your expression. Or get rid of all this fancy stuff then, use [`/[a-zA-Z]+[0-9]*|[0-9]+|\'[^\']*\'|\[[^][]*\]/`](https://regex101.com/r/wR6oX9/1). – Wiktor Stribiżew Aug 17 '16 at 10:05
  • yes please, I want to know this also out of curiosity :) – Ahmad Hajjar Aug 17 '16 at 10:07
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121145/discussion-between-ahmad-hajjar-and-wiktor-stribizew). – Ahmad Hajjar Aug 17 '16 at 10:18
1

What language/application are you using this regular expression in? If you have the option you can specify the different parts as named variables and then build the final regular expression by combining them.

val = "([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])"
op = "(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)"
exp = "(\(" .. val .. "\s*" .. op .. "\s*" .. val .. "\))"
nolan
  • 93
  • 7