1

There are quite a few questions on SO asking about how to parse a regex pattern and output all possible matches to that pattern. For some reason, though, every single one of them I can find (1, 2, 3, 4, 5, 6, 7, probably more) are either for Java or some variety of C (and just one for JavaScript), and I currently need to do this in PHP.

I’ve Googled to my heart’s (dis)content, but whatever I do, pretty much the only thing that Google gives me is links to the docs for preg_match() and pages about how to use regex, which is the opposite of what I want here.

My regex patterns are all very simple and guaranteed to be finite; the only syntax used is:

  • [] for character classes
  • () for subgroups (capturing not required)
  • | (pipe) for alternative matches within subgroups
  • ? for zero-or-one matches

So an example might be [ct]hun(k|der)(s|ed|ing)? to match all possible forms of the verbs chunk, thunk, chunder and thunder, for a total of sixteen permutations.

Ideally, there’d be a library or tool for PHP which will iterate through (finite) regex patterns and output all possible matches, all ready to go. Does anyone know if such a library/tool already exists?

If not, what is an optimised way to approach making one? This answer for JavaScript is the closest I’ve been able to find to something I should be able to adapt, but unfortunately I just can’t wrap my head around how it actually works, which makes adapting it more tricky. Plus there may well be better ways of doing it in PHP anyway. Some logical pointers as to how the task would best be broken down would be greatly appreciated.

Edit: Since apparently it wasn’t clear how this would look in practice, I am looking for something that will allow this type of input:

$possibleMatches = parseRegexPattern('[ct]hun(k|der)(s|ed|ing)?');

– and printing $possibleMatches should then give something like this (the order of the elements is not important in my case):

Array
(
    [0] => chunk
    [1] => thunk
    [2] => chunks
    [3] => thunks
    [4] => chunked
    [5] => thunked
    [6] => chunking
    [7] => thunking
    [8] => chunder
    [9] => thunder
    [10] => chunders
    [11] => thunders
    [12] => chundered
    [13] => thundered
    [14] => chundering
    [15] => thundering
)
Janus Bahs Jacquet
  • 859
  • 1
  • 11
  • 27
  • So, you don't want to match the pattern, you want to see what "words" the pattern would match? – Steven Mar 13 '21 at 11:56
  • @Steven Yes, exactly. I basically want to take a pattern and turn it into a non-regex list of all the strings that pattern could potentially match. – Janus Bahs Jacquet Mar 13 '21 at 11:57
  • Okay, presumably this only applies to regex with limited character classes? Obviously anything with things like `.` or quantifiers (e.g. `+`) could mean rather lengthy lists! – Steven Mar 13 '21 at 11:58
  • @Steven Yes, that’s why I specify that my patterns are all finite (i.e., they don’t contain anything that could make the list of matched strings infinite) – no regex syntax is used at all apart from the four listed in the question. – Janus Bahs Jacquet Mar 13 '21 at 11:59
  • Okay, well if that's the case you could simply strip out the _variable groups_, split them up as needed (by letter for character classes and by word for capture groups), and then use recursion to work your way through at each level? – Steven Mar 13 '21 at 13:06
  • Here’s the [JS version converted to PHP](https://3v4l.org/EFeXi). It doesn’t support `[]` or `?` unfortunately. (I converted on my iPad so it might be a bit ugly, sorry.) – Chris Haas Mar 13 '21 at 13:13
  • @Steven That’s pretty much what my rough mind plan was as well. The bit that I can’t work out in my mind is how exactly to strip out the possibly nested subgroups from the string to begin with. I just realised that PHP can do [recursive regex patterns](https://www.php.net/manual/en/regexp.reference.recursive.php), which might do it. – Janus Bahs Jacquet Mar 13 '21 at 13:18
  • `preg_match_all` should do the trick? – Steven Mar 13 '21 at 13:22
  • @Chris Thanks, that’s a nice shortcut! Character classes should be easy enough since they’re not nestable, and `?` isn’t a problem either (as far as I can think). Was that converted by some fancy tool, or did you just go through and convert manually? – Janus Bahs Jacquet Mar 13 '21 at 13:22

1 Answers1

1

Method

  1. You need to strip out the variable patterns; you can use preg_match_all to do this

    preg_match_all("/(\[\w+\]|\([\w|]+\))/", '[ct]hun(k|der)(s|ed|ing)?', $matches);
    
    /* Regex:
    
    /(\[\w+\]|\([\w|]+\))/
    /                       : Pattern delimiter
     (                      : Start of capture group
      \[\w+\]               : Character class pattern
             |              : OR operator
              \([\w|]+\)    : Capture group pattern
                        )   : End of capture group
                         /  : Pattern delimiter
    
    */
    
  2. You can then expand the capture groups to letters or words (depending on type)

    $array = str_split($cleanString, 1); // For a character class
    $array = explode("|", $cleanString); // For a capture group
    
  3. Recursively work your way through each $array

Code

function printMatches($pattern, $array, $matchPattern)
{
    $currentArray = array_shift($array);

    foreach ($currentArray as $option) {
        $patternModified = preg_replace($matchPattern, $option, $pattern, 1);
        if (!count($array)) {
            echo $patternModified, PHP_EOL;
        } else {
            printMatches($patternModified, $array, $matchPattern);
        }
    }
}

function prepOptions($matches)
{
    foreach ($matches as $match) {
        $cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
        
        if ($match[0] === "[") {
            $array = str_split($cleanString, 1);
        } elseif ($match[0] === "(") {
            $array = explode("|", $cleanString);
        }
        if ($match[-1] === "?") {
            $array[] = "";
        }
        $possibilites[] = $array;
    }
    return $possibilites;
}

$regex        = '[ct]hun(k|der)(s|ed|ing)?';
$matchPattern = "/(\[\w+\]|\([\w|]+\))\??/";

preg_match_all($matchPattern, $regex, $matches);

printMatches(
    $regex,
    prepOptions($matches[0]),
    $matchPattern
);

Additional functionality

Expanding nested groups

In use you would put this before the "preg_match_all".

$regex        = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';

echo preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
    $output = explode("|", $array[3]);
    if ($array[0][-1] === "?") {
        $output[] = "";
    }
    foreach ($output as &$option) {
        $option = $array[2] . $option;
    }
    return $array[1] . implode("|", $output);
}, $regex), PHP_EOL;

Output:

This happen(s|ed) to (become|be|have|having) test case 1?

Matching single letters

The bones of this would be to update the regex:

$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";

and add an else to the prepOptions function:

} else {
    $array = [$cleanString];
}

Full working example

function printMatches($pattern, $array, $matchPattern)
{
    $currentArray = array_shift($array);

    foreach ($currentArray as $option) {
        $patternModified = preg_replace($matchPattern, $option, $pattern, 1);
        if (!count($array)) {
            echo $patternModified, PHP_EOL;
        } else {
            printMatches($patternModified, $array, $matchPattern);
        }
    }
}

function prepOptions($matches)
{
    foreach ($matches as $match) {
        $cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
        
        if ($match[0] === "[") {
            $array = str_split($cleanString, 1);
        } elseif ($match[0] === "(") {
            $array = explode("|", $cleanString);
        } else {
            $array = [$cleanString];
        }
        if ($match[-1] === "?") {
            $array[] = "";
        }
        $possibilites[] = $array;
    }
    return $possibilites;
}

$regex        = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";

$regex = preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
    $output = explode("|", $array[3]);
    if ($array[0][-1] === "?") {
        $output[] = "";
    }
    foreach ($output as &$option) {
        $option = $array[2] . $option;
    }
    return $array[1] . implode("|", $output);
}, $regex);


preg_match_all($matchPattern, $regex, $matches);

printMatches(
    $regex,
    prepOptions($matches[0]),
    $matchPattern
);

Output:

This happens to become test case 1
This happens to become test case 
This happens to be test case 1
This happens to be test case 
This happens to have test case 1
This happens to have test case 
This happens to having test case 1
This happens to having test case 
This happened to become test case 1
This happened to become test case 
This happened to be test case 1
This happened to be test case 
This happened to have test case 1
This happened to have test case 
This happened to having test case 1
This happened to having test case 
Steven
  • 6,053
  • 2
  • 16
  • 28
  • Updated to include the `?` functionality. – Steven Mar 13 '21 at 13:54
  • This is quite ingenious… I need to read through it more carefully to get the inner workings mapped out properly in my head, but I get the gist of the flow. Unfortunately, though, it won’t work with nested subgroups, which is also the part that I’m having most trouble figuring out. I’ve tried a few permutations of the match pattern (like `/(\[\w+\]|\(((?>[^()]+)|(?R))\))\??/`), which haven’t worked. I wonder if it’s even possible to get it all in one `preg_match_all()` call. To give an actual example of a pattern with nested submatches, I have `andagts(bog(en)?|bøger(ne)?)s?`. – Janus Bahs Jacquet Mar 13 '21 at 14:01
  • That is true, for infinite recursion at least. In practice, the recursion will never be more than three levels at the most, so still finite, though still more complex than without it. I do have control over the source, so perhaps I should just go through that and avoid nested subgroups. There aren’t _that_ many of them anyway, and they can easily be turned into non-nested alternatives… – Janus Bahs Jacquet Mar 13 '21 at 14:21
  • You could always expand the nested groups? Something like: `echo preg_replace( "/(\(|\|)(\w+)\((\w+)\)\?/", "$1$2|$2$3", $pattern );` (this snippet also assumes that the nested group is always in the form `(...)?`) – Steven Mar 13 '21 at 15:18
  • I ended up just removing the nested groups – there were only about a dozen of them in total (out of about 2,000 patterns). While this method doesn’t expand `?` to variants with and without the preceding entity, and there are some cases where it doesn’t expand as expected, it worked well enough for my purposes. The goal is to make a query list file for an index in an InDesign document generated with the plugin IndexMatic. IndexMatic does support regex, but it was choking on my query list file, probably because there was _too much_ regex, and too complex. But the output of this method works. – Janus Bahs Jacquet Mar 14 '21 at 10:18
  • @JanusBahsJacquet Nested groups are tricky with regex; however, for simple nests you could use `preg_replace(_callback)?`. I've added an example to the answer. Adding support to handle a letter followed by a question should be quite simple - I will add an example of that too! – Steven Mar 14 '21 at 12:14