Longest Match
Unfortunately, there is no distinct logic to tell a regular expression
engine to get the longest match possible.
Doing so would/could create a cascading backtracking episode gone wild.
It is, by definition a complexity too great to deal with.
All regular expressions are processed from left to right.
Anything the engine can match first it will, then bail out.
This is especially true of alternations, where this|this is|this is here
will always match 'this
is here' first and
will NEVER ever match this is
nor this is here
Once you realize that, you can reorder the alternation into
this is here|this is|this
which gives the longest match every time.
Of course this can be reduced to this(?:(?: is)? here)?
which is the clever way of getting the longest match.
Haven't seen any examples of the regex's you want to combine,
so this is just some general information.
If you show the regexes you're trying to combine, better solution could be
provided.
Alternation contents do affect each other, as well as whatever precedes or
follows the cluster can have an affect on which alternation gets matched.
If you have more questions just ask.
Addendum:
For @Laurel. This could always be done with a Perl 5 regex (>5.10)
because Perl can run code from within regex sub-expressions.
Since it can run code, it can count and get the longest match.
The rule of leftmost first, however, will never change.
If regex were thermodynamics, this would be the first law.
Perl is a strange entity as it tries to create a synergy between regex
and code execution.
As a result, it is possible to overload it's operators, to inject
customization into the language itself.
Their regex engine is no different, and can be customized the same way.
So, in theory, the regex below can be made into a regex construct,
a new Alternation construct.
I won't go into detail's here, but suffice it to say, it's not for the faint at heart.
If you're interested in this type of thing, see the perlre manpage under
section 'Creating Custom RE Engines'
Perl:
Note - The regex alternation form is based on @Laurel complex example
(a|ab.*c|.{0,2}c*d)
applied to abcccd
.
Visually, if made into a custom regex construct, would look similar to
an alternation (?:rx1||rx2||rx3)
and I'm guessing this is how a lot of
Perl6 is done in terms of integrating regex engine directly into the language.
Also, if used as is, it's possible to construct this regex dynamically as needed.
And note that all the richness of Perl regex constructs are available.
Output
Longest Match Found: abcccd
Code
use strict;
use warnings;
my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';
# Formatted using RegexFormat7 (www.regexformat.com)
if ( $targ =~
/
# The Alternation Construct
(?=
( a ) # (1)
(?{ $p1 = length($^N) })
)?
(?=
( ab .* c ) # (2)
(?{ $p2 = length($^N) })
)?
(?=
( .{0,2} c*d ) # (3)
(?{ $p3 = length($^N) })
)?
# Check At Least 1 Match
(?(1)
(?(2)
(?(3)
| (?!)
)
)
)
# Consume Longest Alternation Match
( # (4 start)
(?(?{
$p1>=$p2 && $p1>=$p3
})
\1
| (?(?{
$p2>=$p1 && $p2>=$p3
})
\2
| (?(?{
$p3>=$p1 && $p3>=$p2
})
\3
)
)
)
) # (4 end)
/x ) {
print "Longest Match Found: $4\n";
} else {
print "Did not find a match!\n";
}