There's a lot of things that regular expressions can do - some of which are - as you say - 'dark magic'. But the core problem is - pretty fundamentally, regular expressions are about text selection can capture. They don't 'do' match comparison or evaluation - they either match or they do not.
You can see what the regex is doing, by enabling it in debug mode. For this, I'll use perl
because you can set use re 'debug';
':
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
my @matches = "abcemtcmncefmf" =~ m/(cm|c.m|c..m)/;
print join "\n", @matches;
This will print what the regex engine is doing as it goes:
Compiling REx "(cm|c.m|c..m)"
Final program:
1: OPEN1 (3)
3: TRIE-EXACT[c] (19)
<cm> (19)
<c> (9)
9: REG_ANY (10)
10: EXACT <m> (19)
<c> (15)
15: REG_ANY (16)
16: REG_ANY (17)
17: EXACT <m> (19)
19: CLOSE1 (21)
21: END (0)
stclass AHOCORASICK-EXACT[c] minlen 1
Matching REx "(cm|c.m|c..m)" against "abcemtcmncefmf"
Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes)
0 <> <abcemtcmnc> | Scanning for legal start char...
2 <ab> <cemtcmncef> | Charid: 1 CP: 63 State: 1, word=0 - legal
3 <abc> <emtcmncefm> | Charid: 0 CP: 65 State: 2, word=2 - fail
3 <abc> <emtcmncefm> | Fail transition to State: 1, word=0 - fail
Matches word #2 at position 2. Trying full pattern...
2 <ab> <cemtcmncef> | 1:OPEN1(3)
2 <ab> <cemtcmncef> | 3:TRIE-EXACT[c](19)
2 <ab> <cemtcmncef> | State: 1 Accepted: N Charid: 1 CP: 63 After State: 2
3 <abc> <emtcmncefm> | State: 2 Accepted: Y Charid: 0 CP: 65 After State: 0
got 2 possible matches
TRIE matched word #2, continuing
3 <abc> <emtcmncefm> | 9: REG_ANY(10)
4 <abce> <mtcmncefmf> | 10: EXACT <m>(19)
5 <abcem> <tcmncefmf> | 19: CLOSE1(21)
5 <abcem> <tcmncefmf> | 21: END(0)
Match successful!
Freeing REx: "(cm|c.m|c..m)"
Hopefully you can see what it's doing here?
- working from left to right
- hits the first 'c'
- checks to see if 'cm' matches (fails)
- checks to see if 'c.m' matches (succeeds).
- bails out here and returns hits.
Turn on g
and you get it to work multiple times - I shan't reproduce it, but it's quite a lot longer.
Whilst you can do a lot of clever tricks with PCRE, such as look around, look ahead, greedy/nongreedy matching.... pretty fundamentally, here, you are trying to select multiple valid matches, and pick the shortest. And regex
can't do that.
I would offer though - with that same perl
, the process of finding the shortest is quite easy:
use List::Util qw/reduce/;
print reduce { length( $a ) < length( $b ) ? $a : $b } @matches;