8

I have a expression like c.{0,2}?mand a string like "abcemtcmncefmf". Currently it will matches three substrings: cem, cm and cefm (see here). But I like to match only the smallest of this, in this case, cm.

My problem is that I don't have a global match support, only the first match, because I'm using MariaDB REGEXP_SUBSTR() function. My current solution is a stored procedure that I created to solve my problem. But it is 10 times slower than just a regular expression for simple cases.

I too tried do something like: (cm|c.{0,1}?m|c.{0,2}?m), but it doesn't worked because it will match first of any group patterns, instead of try one by one in all subject string.

I know that regular expressions (PCRE) have some black magic features, but I don't found nothing to solve my problem.


Community
  • 1
  • 1
David Rodrigues
  • 12,041
  • 16
  • 62
  • 90
  • 2
    I think `regex` is the wrong tool for this job really. It's built around matching (and replacing) strings. Not sorting logic and comparisons. – Sobrique Jan 29 '16 at 18:50
  • What is your capturing logic and what is the point of `.{0,2}?` after `m`? – anubhava Jan 29 '16 at 18:50
  • @Sobrique on reality I don't need a *sorting logic*, but if each pattern on group (a|b|c) could be matched one by one (first try a, if fail, then try b, if fail, then try c) on all string (until end for each group item pattern) and return the first one, it should resolve without sorting anything. Thanks. – David Rodrigues Jan 29 '16 at 19:06
  • @anubhava Yeah! After `m` it don't care, on reality. I'll edit it. Thanks. – David Rodrigues Jan 29 '16 at 19:07
  • 2
    But that's the thing - you _do_ need sorting logic to match the pattern multiple times, and pick out the shortest. Your regex engine is working through the string, and cycling through each possible match to see which is valid. So it will only _ever_ give the leftmost string that matches any of the pattern combinations. – Sobrique Jan 29 '16 at 19:12

3 Answers3

4

There's a lot of things that regular expressions can do - some of which are - as you say - 'dark magic'. But the core problem is - pretty fundamentally, regular expressions are about text selection can capture. They don't 'do' match comparison or evaluation - they either match or they do not.

You can see what the regex is doing, by enabling it in debug mode. For this, I'll use perl because you can set use re 'debug';':

#!/usr/bin/env perl

use strict;
use warnings;

use re 'debug';

my @matches = "abcemtcmncefmf" =~ m/(cm|c.m|c..m)/;
print join "\n", @matches;

This will print what the regex engine is doing as it goes:

Compiling REx "(cm|c.m|c..m)"
Final program:
   1: OPEN1 (3)
   3:   TRIE-EXACT[c] (19)
        <cm> (19)
        <c> (9)
   9:     REG_ANY (10)
  10:     EXACT <m> (19)
        <c> (15)
  15:     REG_ANY (16)
  16:     REG_ANY (17)
  17:     EXACT <m> (19)
  19: CLOSE1 (21)
  21: END (0)
stclass AHOCORASICK-EXACT[c] minlen 1 
Matching REx "(cm|c.m|c..m)" against "abcemtcmncefmf"
Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes)
   0 <> <abcemtcmnc>         | Scanning for legal start char...
   2 <ab> <cemtcmncef>       | Charid:  1 CP:  63 State:    1, word=0 - legal
   3 <abc> <emtcmncefm>      | Charid:  0 CP:  65 State:    2, word=2 - fail
   3 <abc> <emtcmncefm>      | Fail transition to State:    1, word=0 - fail
Matches word #2 at position 2. Trying full pattern...
   2 <ab> <cemtcmncef>       |  1:OPEN1(3)
   2 <ab> <cemtcmncef>       |  3:TRIE-EXACT[c](19)
   2 <ab> <cemtcmncef>       |    State:    1 Accepted: N Charid:  1 CP:  63 After State:    2
   3 <abc> <emtcmncefm>      |    State:    2 Accepted: Y Charid:  0 CP:  65 After State:    0
                                  got 2 possible matches
                                  TRIE matched word #2, continuing
   3 <abc> <emtcmncefm>      |  9:  REG_ANY(10)
   4 <abce> <mtcmncefmf>     | 10:  EXACT <m>(19)
   5 <abcem> <tcmncefmf>     | 19:  CLOSE1(21)
   5 <abcem> <tcmncefmf>     | 21:  END(0)
Match successful!
Freeing REx: "(cm|c.m|c..m)"

Hopefully you can see what it's doing here?

  • working from left to right
  • hits the first 'c'
  • checks to see if 'cm' matches (fails)
  • checks to see if 'c.m' matches (succeeds).
  • bails out here and returns hits.

Turn on g and you get it to work multiple times - I shan't reproduce it, but it's quite a lot longer.

Whilst you can do a lot of clever tricks with PCRE, such as look around, look ahead, greedy/nongreedy matching.... pretty fundamentally, here, you are trying to select multiple valid matches, and pick the shortest. And regex can't do that.

I would offer though - with that same perl, the process of finding the shortest is quite easy:

use List::Util qw/reduce/;
print  reduce { length( $a ) < length( $b ) ? $a : $b } @matches;
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • It is basically what I did on *stored procedure*, but it is too slow (ten times slower on small texts). On MariaDB I don't have the `g` modifier or similar, so I need execute regular expression *N* times on same string until the end, checking each time if the current match is smaller than previous. I understand that is not possible apply sort, but maybe do [**something like it**](http://pastebin.com/NTYu2HaH). – David Rodrigues Jan 29 '16 at 19:59
4

You can simply use an alternation in a branch reset group:

/^(?|.*(cm)|.*(c.m)|.*(c..m))/s

(The result is in group 1)

or like this:

/^.*\Kcm|^.*\Kc.m|^.*\Kc..m/s

The first successful branch wins.

ikegami
  • 367,544
  • 15
  • 269
  • 518
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Wow!!! The alternation (`?|`) is exactly what I expected. I know that I saw it in some place before, but I don't found it. Thanks a lot! It'll simplify very much my current code! – David Rodrigues Jan 31 '16 at 04:28
  • I made some changes on your code. I used something like `/(?|\K(cm)|\K(c.m)|\K(c..m))/`. It have two advanges: reduced from 9690 steps to 1220 steps, and the match #0 is the own group #1 (I need it because MariaDB will uses the match #0). My query jumped from 5 seconds to 500ms. :) – David Rodrigues Jan 31 '16 at 05:15
  • @DavidRodrigues: `/(?|\K(cm)|\K(c.m)|\K(c..m))/` that is like `(?:cm|c.m|c..m)` doesn't produce the same result because it will return the first position where one of the branches succeeds. In my example, each branch are tested one by one from the start of the string. Take the time to test each answer to find the one that is the more efficient for your situation. – Casimir et Hippolyte Jan 31 '16 at 12:54
  • @DavidRodrigues: you can try this one too: `/^[^c]*+(?:(?:c(?!m)[^c]*)*+\Kcm|(?:c(?!.m)[^c]*)*+\Kc.m|(?:c(?!..m)[^c]*)*+\Kc..m)/` it's a bit longer but it avoids backtracking. – Casimir et Hippolyte Jan 31 '16 at 13:19
  • You are right. I do some more testing here and seems that your original code works better. I'll use that. Thanks again! – David Rodrigues Feb 01 '16 at 14:09
2

Technically, it can be done.

my ($match) = /
   ^
   (?:(?! c[^m]{0,2}m ).)*+         # Skip past area with no matches.
   (?:
      (?:(?! c[^m]{0,1}m ).)*+      # Skip past area with no matches except longuest.
      (?:
         (?:(?! c[^m]{0,0}m ).)*+   # Skip past area with no matches except 2 longuest.
      )?
   )?
   ( c[^m]{0,2}m )
/xs;

[Note: Removing the possessive quantifier modifiers (+) will greatly affect performance.]

But it's usually far, far better to find all matches and locate the smallest one.

use List::Util qw( reduce );
my ($match) = reduce { length($a) <= length($b) ? $a : $b } /c[^m]{0,2}m/g;
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Wow!!! I know it! Regular expressions does dark magic! It'll work for me. Now I just need thinks how I can generate it from user input. But for now, it solve my problem. Thanks a lot! – David Rodrigues Jan 29 '16 at 21:04
  • Which adaptations I should do to expand it to three characters long? Example, instead of search by `cm`, search by `cpm`. – David Rodrigues Jan 29 '16 at 21:08
  • But and about the `[^m]`, it should be changed too, right? Should be something like `c[^p]{0,2}p[^m]{0,2}m`? – David Rodrigues Jan 29 '16 at 21:13
  • Yes, I mean, following your code. Instead of I use `YourDarkMagic("cm")`, I need use `YourDarkMagic("cpm")` to allow match `cpm`, `cepem`, ... – David Rodrigues Jan 29 '16 at 21:14
  • 1
    The fact that you don't know shows why it's a bad idea. The contents of the first `(?!...)` should match all patterns you want to match. The contents of the second `(?!...)` should match all patterns you want to match except the longest. The contents of the third `(?!...)` should match all patterns you want to match except the longest and the second longest. etc. The final bit in `(...)` should match all the patterns you want to match (like the contents of the first `(?!...)`). – ikegami Jan 29 '16 at 21:17
  • I'll study your code first. But I imagine that should be like my comment above. I need create some tests. I am asking just because, if it is a simple change, is easy to reply. Else, I'll work hard to understand and expand it to more characters. But in all way, thanks! You help me a lot. -- I tried here, is just expand `[^m]{0,2}m` to each character that I need. – David Rodrigues Jan 29 '16 at 21:20
  • `c[^p]{0,2}p[^m]{0,2}m` will appear twice, but that's the easy part. What's the pattern that matches all but the longest? It's NOT `c[^p]{0,1}p[^m]{0,1}m`. It's `c[^p]{0,1}(?:p[^m]|[^p]p)[^m]{0,1}m` – ikegami Jan 29 '16 at 21:23
  • Oh, yeah. Now I'm confused. It is better I first use the stored procedure, for now, until I be capable to understand how this expression works correctly. – David Rodrigues Jan 29 '16 at 21:28
  • The max length of `c[^p]{0,2}p[^m]{0,2}m` is 7. The max length of `c[^p]{0,1}p[^m]{0,1}m` is 5. What happened to length 6 (e.g. `cxxpxm` and `cxxpxm`)? – ikegami Jan 29 '16 at 21:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102037/discussion-between-david-rodrigues-and-ikegami). – David Rodrigues Jan 29 '16 at 21:32