13

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.

I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?

user200783
  • 13,722
  • 12
  • 69
  • 135
  • 2
    If these are all literal strings, you might want to sort the item array alphabetically and by the string length (descending), then join the items with `|`. – Wiktor Stribiżew Mar 14 '16 at 20:29
  • @WiktorStribiżew: Thanks - unfortunately the `regexI` are not all literal strings. – user200783 Mar 14 '16 at 20:33
  • Hi @PaulBaker , Can you show us an example, or create a http://regex101.com Where we can try out a few things. It sounds like you are looking for something I also needed a while ago, using REGEX IF/THEN/ELSE. Also what language would you eventually use this in? since your question might require comparisions executed outside REGEX et al. – TolMera Mar 30 '16 at 08:24

3 Answers3

12

Use the right regex flavor!

In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.

Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.

Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)

This means, you can have ANY order of statements in a pipe and the result will always be the longest.

(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).

According to this page*:

The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.


* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.

Let me give you one specific example that I have verified on my own computer:

echo "ab c a" | sed -E 's/(a|ab)/replacement/'

The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.

This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.

Try it here!

More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).

In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.


Other examples of flavors with this behavior:

  • Tcl ARE
  • POSIX ERE
  • GNU BRE
  • GNU ERE

Relevant information about regex engine design

This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path: Thompson algorithm for alternation]]


Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.

Community
  • 1
  • 1
Laurel
  • 5,965
  • 14
  • 31
  • 57
  • 1
    You might be confused about longest match. The primary function of all regex engines is to get the _leftmost_ match, be it longest or not. It is true POSIX dictates longest match, but that doesn't override leftmost priority. Longest match is a holdover to the time when there was no concept of greedy/non-greedy, it was all greedy. To that extent, alternation construct followed this paradigm. To make the longest match work, it can't have a backtracking mechanism, however it is very slow trying all the permutations. –  Mar 30 '16 at 15:24
  • As an example of leftmost priority, `(a|b.*)` will not match "a`b_____`" but instead matches "`a`b_____", which is the shortest match but the _leftmost_ match. –  Mar 30 '16 at 15:29
  • `A POSIX-compliant engine will still find the leftmost match. If you apply Set|SetValue to Set or SetValue once, it will match Set. The first position in the string is the leftmost position where our regex can find a valid match. The fact that a longer match can be found further in the string is irrelevant.` If this weren't true of _all_ regex engines, the language would be useless. –  Mar 30 '16 at 15:37
  • 1
    @sin `If you apply Set|SetValue to Set or SetValue once, it will match Set` is **wrong** (in the languages I'm talking about). Your other point, about `(a|b.*)` not matching things does not apply to the question at hand here: `knowing if regexK can match a prefix of regexL`. – Laurel Mar 30 '16 at 22:37
  • @Lurel - The question at hand is not if a prefix of two regexes matches here since he states it's not all literals, but is all alternations. Unless its all wildcard matching `.+` or all optional where anything can match anything, it means there will be incongruent distinction. I.e. at some point the alternation items becomes divergent. So the notion of prefixes becomes a simple common factored out issue, but divergent after that. Either way, the examples you show are irrelavent in that he has a mix of literal and wild types. Besides, POSIX doesn't present any rich constructs, simple only. –  Mar 30 '16 at 22:57
  • And it's always the _leftmost_ match that has highest precedence. Means the shortest match could occur first or not, but it won't be governed by an alternation construct in POSIX unless there is a true prefix within the alternation _AND_ the most important thing, if it is matches at the _current_ (leftmost) character position. The dog wagging the tail is the notion of _leftmost_ not longest. –  Mar 30 '16 at 23:06
  • 1
    @sln I updated my post with a more complex example. AND I updated things for clarity. I understand fully that regexes work leftward through the string to find matches, but my point is that OP asked for **matches at the beginning of the string**, which may mean that they have a regex like this: `^(regex1|regex2|...)` – Laurel Mar 31 '16 at 15:47
  • I don't think the op means to use the BOS anchor. But if used, it greatly restricts the alternations. Also, your more complex example is just as I described, filled with wild cards and common characters. In the real world, this just won't happen. The combination of distinction in alternation construct parts and the actual layout of the text will almost certainly guarantee that getting the longest match length will be by accident and have no meaning whatsoever. In that regard, POSIX is irrelevant, and oh so slow and crippled. But, still the OP hasn't shown any regex sample's. –  Mar 31 '16 at 16:15
  • @Laurel: You mention *these regex flavors are old* - Perl6 is not old, and has a support for an alternation with longest match preference (see [*Longest Alternation*](https://doc.perl6.org/language/regexes#Longest_Alternation)). – Wiktor Stribiżew Apr 01 '16 at 08:30
  • @WiktorStribiżew Fixed it, I was thinking that Perl probably had support for this, but I don't know Perl. I might add more info later from your link. – Laurel Apr 01 '16 at 13:34
  • @Laurel: Not Perl, Perl6. These are supposed to be different "flavors". – Wiktor Stribiżew Apr 01 '16 at 13:35
  • @WiktorStribiżew I don't know any Perl, so I use `Perl` to mean any/all of the Perl versions. I checked, I have `perl 5, version 12, subversion 3 (v5.12.3)` installed. So close, yet not Perl6. I assume there's a PCRE that has the same setting available, right? – Laurel Apr 01 '16 at 13:45
  • @Laurel: Perl 6 is a ground-up rewrite of Perl 5, and the regex flavor is very different. PCRE is based on Perl 5, as are all other "Perl-compatible" or "Perl-derivative" flavors. If any of them have incorporated longest-token matching, I haven't heard about it. – Alan Moore Apr 01 '16 at 14:05
  • @Laurel - I updated to show how this can be done in Perl 5 (>.10) series. Perl is a different animal than PCRE in its integration of regex code constructs `(?{})` and `(?{{}})`. –  Apr 01 '16 at 17:44
  • 1
    PCRE has a DFA mode which essentially always finds the longest match (it's input-driven, not pattern-driven). Same for RE2 etc. You give up sone regex features for this to be possible though. – Lucas Trzesniewski Apr 01 '16 at 21:47
  • @LucasTrzesniewski I searched, and I think you're talking about:`pcre_dfa_exec(), pcre16_dfa_exec() and pcre32_dfa_exec()`, right? – Laurel Apr 01 '16 at 21:55
  • @Laurel right. Nowadays that would be `pcre2_dfa_match` though, (use PCRE2, it's the newer version). See [here](http://www.pcre.org/current/doc/html/pcre2matching.html). – Lucas Trzesniewski Apr 01 '16 at 21:56
  • I looked at that pcre2 page. The `pcre2_dfa_match()` explanation is convoluted. Says it finds all paths from the beginning of the target string to the end, even though it returns the leftmost longest match. A supreme waste of time and prohibitively slow on open ended repetition. Typical dfa. Have to design a regex with extreme caution. The richness of pcre is dumbed out, no captures, no backreferences, no real useable conditionals. Assertions are there, but who knows what they produce. This is not a model of a modern day regex engine. –  Apr 02 '16 at 13:33
7

Longest Match

Unfortunately, there is no distinct logic to tell a regular expression
engine to get the longest match possible.

Doing so would/could create a cascading backtracking episode gone wild.
It is, by definition a complexity too great to deal with.

All regular expressions are processed from left to right.
Anything the engine can match first it will, then bail out.

This is especially true of alternations, where this|this is|this is here
will always match 'this is here' first and
will NEVER ever match this is nor this is here

Once you realize that, you can reorder the alternation into
this is here|this is|this which gives the longest match every time.

Of course this can be reduced to this(?:(?: is)? here)?
which is the clever way of getting the longest match.

Haven't seen any examples of the regex's you want to combine,
so this is just some general information.

If you show the regexes you're trying to combine, better solution could be
provided.

Alternation contents do affect each other, as well as whatever precedes or
follows the cluster can have an affect on which alternation gets matched.

If you have more questions just ask.


Addendum:

For @Laurel. This could always be done with a Perl 5 regex (>5.10)
because Perl can run code from within regex sub-expressions.
Since it can run code, it can count and get the longest match.

The rule of leftmost first, however, will never change.
If regex were thermodynamics, this would be the first law.

Perl is a strange entity as it tries to create a synergy between regex
and code execution.

As a result, it is possible to overload it's operators, to inject
customization into the language itself.
Their regex engine is no different, and can be customized the same way.

So, in theory, the regex below can be made into a regex construct,
a new Alternation construct.

I won't go into detail's here, but suffice it to say, it's not for the faint at heart.
If you're interested in this type of thing, see the perlre manpage under
section 'Creating Custom RE Engines'

Perl:

Note - The regex alternation form is based on @Laurel complex example
(a|ab.*c|.{0,2}c*d) applied to abcccd.

Visually, if made into a custom regex construct, would look similar to
an alternation (?:rx1||rx2||rx3) and I'm guessing this is how a lot of
Perl6 is done in terms of integrating regex engine directly into the language.

Also, if used as is, it's possible to construct this regex dynamically as needed.
And note that all the richness of Perl regex constructs are available.

Output

Longest Match Found:  abcccd

Code

use strict;
use warnings;

my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';

# Formatted using RegexFormat7 (www.regexformat.com)

if ( $targ =~
/
   # The Alternation Construct
     (?=
          ( a )                         # (1)
          (?{ $p1 = length($^N) })
     )?
     (?=
          ( ab .* c )                   # (2)
          (?{ $p2 = length($^N) })
     )?
     (?=
          ( .{0,2} c*d )                # (3)
          (?{ $p3 = length($^N) })
     )?
   # Check At Least 1 Match
     (?(1)
          (?(2)
               (?(3)
                 |  (?!)
               )
          )
     )
   # Consume Longest Alternation Match
     (                                  # (4 start)
          (?(?{
               $p1>=$p2 && $p1>=$p3
            })
               \1 
            |  (?(?{
                    $p2>=$p1 && $p2>=$p3
                 })
                    \2 
                 |  (?(?{
                         $p3>=$p1 && $p3>=$p2
                      })
                         \3 
                    )
               )
          )
     )                                  # (4 end)
/x ) {

    print "Longest Match Found:  $4\n";
} else {
    print "Did not find a match!\n";
}
2

For sure a human might be able judging whther two given regexp are matching prefixes for some cases. In general this is an n-p-complete problem. So don't try.

In the best case combining the different regexp into a single one will give a suitable result cheap. However, I'm not aware of any algorithm that can take two arbitrary regexp and combine them in a way that the resulting regexp is still matching what any of the two would match. It would be n-p-complete also.

You must also not rely on ordering of alternatives. This depends on the internal execution logic of the regexp engine. It could easily be that this is reordering the alternatives internally beyond your control. So, a valid ordering with current engine mmight give wrong results with a different engine. (So, it could help as long as you stay with a single regexp engine implementation)

Best approach seems to me to simply execute all regexp, keep track of the matched length and then take the longest match.

rpy
  • 3,953
  • 2
  • 20
  • 31