I'm using a commercial closed-source Java application that, besides everything it does, allows to filter text fields by providing a regex pattern string. I'm using that filter functionality quite extensively.
The issue I'm having is that I often find myself repeating the same exact subpatterns in the regex. For example, here
^(
( # pattern foo
foo_([^_]+)_(windows|linux|osx)
)
|
( # pattern bar
([^_]+)_bar_(windows|linux|osx)_foo_(windows|linux|osx)
)
)$
The ([^_]+)
and (windows|linux|osx)
parts repeat quite often.
That's just a made up example. The original regex is more complex, about 20 times larger and has a lot of different repeats. It becomes a bit harder to read since the repeated subpatterns only keep growing in size as well as in number, and it's troublesome that when you try to modify a repeated subpattern, you have to modify all its repeats too.
So, I played with regex101 and came up with this
^(
( # a dummy option, defines some frequently used capture groups
(?!x)x # always false, so nothing matches this and the following groups ever
(?'name'[^_]+) # group "name"
(?'os'windows|linux|osx) # group "os"
)
|
( # pattern foo
foo_\g'name'_\g'os'
)
|
( # pattern bar
\g'name'_bar_\g'os'_foo_\g'os'
)
)$
Now all of the subpatterns are named and whenever I reference the name, they are replaced with the subpattern string (i.e. \g'os'
gets replaced by (windows|linux|osx)
). The names are a lot shorter than the corresponding subpattern, they also are clear and you have to modify a subpattern once for the modification to apply everywhere in the regex.
The issue with this improved version is that while it's a valid PHP pcre regex, it's invalid Java regex. Comments and broken lines in the regex aside, Java doesn't support \g
, as stated in Comparison to Perl 5.
Is there any way I can "factor out" the repeated regex patterns like that in Java Regex? Don't forget that all I can do is provide a pattern string, I have no access to the code.