4

i'll start with an example because i'm not sure i can explain this properly.

the easier part of the problem (i think although i can't get it either):

  • take some strings eg. 'Example', 'DOMNode', 'DOMText', 'DOMElement'
  • and output '(Example|DOM(Node|Text|Element))'

a more complicated part of the problem is matching from both ends of the strings

  • take some strings eg. 'Example', 'ArrayIterator', 'RecursiveArrayIterator', 'DirectoryIterator', 'RecursiveDirectoryIterator'

  • and output '(Example|(Recursive)?(Array|Directory)Iterator)'

  • i have a list of strings (patterns) to match against a subject.

  • i could simply concatenate the patterns with alternation (which is my current system) but i want to find a way to group common prefixes into alternation groups.

it's not really that much of an optimization but i've been trying to do it as an exercise for fun and now it's just giving me a headache haha.

i tried breaking each string by letter, matching every possible combination.

can't remember all the things i've tried a.t.m. i'm burning both ends of the candle.

i couldn't find a way to get common prefixes or store them so i could rebuild them into a regex. seems like a simple problem but i'm stuck.

i have this function for separating strings with underscores: (works good when you know how to separate prefixes ie by underscores)

<?php
/**
 * separates snake case names into nested hierarchies.
 */
function processArray(array $funcs): array
 {
    $loop = false;
    $current = false;
    $newFuncs = [];
    foreach ($funcs as $name)
     {
        $pos = strpos($name, '_');
        if ($current and !str_starts_with($name, $current))
         {
            if ($loop || $pos)
             {
                $newFuncs[$current] = processArray($newFuncs[$current]);
                $loop = false;
             }
            $current = false;
         }
        if ($pos)
         {
            $current = substr($name, 0, $pos + 1);
            $newFuncs[$current] ??= [];
            $subName = substr($name, $pos + 1);
            $newFuncs[$current][] = $subName;
            if (strpos($subName, '_'))
             {
                $loop = true;
             }
         }
        else
         {
            if ($loop)
             {
                $newFuncs[$current] = processArray($newFuncs[$current]);
                $loop = false;
             }
            $current = false;
            $newFuncs[] = $name;
         }
     }
    return $newFuncs;
 }

function getRegex(array $strs): string
 {
    static $level = 0;
    $ret = '(';
    foreach ($strs as $key => $value)
     {
        if (is_array($value))
         {
            $strs[$key] = (is_string($key)?$key:'').getRegex($value);
         }
     }
    $ret.= implode('|', $strs);
    $ret.= ')';
    return $ret;
 }


$funcs = get_defined_functions()['internal'];
sort($funcs);

$funcs = processArray($funcs);
$getRegex = getRegex($funcs);

//remove isolated groups (groups with only one alternation)
do
 {
    $getRegex = preg_replace('~\(([a-zA-Z_0-9]+?)\)~', '$1', $getRegex, -1, $count);
 } 
while ($count);

var_dump($getRegex);

UPDATE

so I nearly solved it except now where an optional part is present, it adds an empty alternation before the optional string which I think will match first and never match the optional part eg My(|Opt) will not match MyOpt.

here's what i did:


function processRegexArray(array $funcs): array
 {
    $loop = false;
    $current = false;
    $newFuncs = [];
    foreach ($funcs as $name)
     {
        preg_match('~^([A-Z]+(?=[A-Z])|[A-Z0-9_]*[a-z]+)~', $name, $matches);
        $pos = strlen($matches[0]??='') -1;
        if ($current and !str_starts_with($name, $current))
         {
            if ($loop || $pos)
             {
                $newFuncs[$current] = processRegexArray($newFuncs[$current]);
                $loop = false;
             }
            $current = false;
         }
        if ($pos)
         {
            $current = substr($name, 0, $pos + 1);
            $newFuncs[$current] ??= [];
            $subName = substr($name, $pos + 1);
            $newFuncs[$current][] = $subName;
            preg_match('~^([A-Z]+(?=[A-Z])|[A-Z0-9_]*[a-z]+)~', $subName, $matches);
            if ($matches)
             {
                $loop = true;
             }
         }
        else
         {
            if ($loop)
             {
                $newFuncs[$current] = processRegexArray($newFuncs[$current]);
                $loop = false;
             }
            $current = false;
            $newFuncs[] = $name;
         }
     }
    return $newFuncs;
 }

function getRegex(array $strs): string
 {
    if (count($strs) === 0)
        return '';
    static $level = 0;
    $ret = '(';
    foreach ($strs as $key => $value)
     {
        if (is_array($value))
         {
            $strs[$key] = (is_string($key)?$key:'').getRegex($value);
         }
     }
    $ret.= implode('|', $strs);
    $ret.= ')';
    return $ret;
 }

$members = [...get_declared_classes(), ...get_declared_interfaces(), ...get_declared_traits()];
sort($members);

// remove namespaced names for now
foreach ($members as $key => $value)
 {
    if (strpos($value, '\\') !== false)
     {
        unset($members[$key]);
     }
 }
$members = processRegexArray($members);

$getRegex = getRegex($members);
do
 {
    $getRegex = preg_replace('~\(([^|()]*?)\)~', '$1', $getRegex, -1, $count);
 } 
while ($count);
var_dump($getRegex);

should output:

string(2302) "(AllowDynamicProperties|AppendIterator|ArgumentCountError|ArithmeticError|Array(Access|Iterator|Object)|AssertionError|Attribute|BackedEnum|Bad(FunctionCallException|MethodCallException)|CURL(File|StringFile)|CachingIterator|CallbackFilterIterator|ClosedGeneratorException|Closure|CompileError|Countable|Curl(Handle|MultiHandle|ShareHandle)|DOM(Attr|CdataSection|CharacterData|ChildNode|Comment|Document|DocumentFragment|DocumentType|Element|Entity|EntityReference|Exception|Implementation|NameSpaceNode|NamedNodeMap|Node|NodeList|Notation|ParentNode|ProcessingInstruction|Text)|DOMXPath|Date(Interval|Period|Time(|Immutable|Interface|Zone))|DeflateContext|Directory(|Iterator)|DivisionByZeroError|DomainException|EmptyIterator|Error(|Exception)|Exception|Fiber(|Error)|FilesystemIterator|FilterIterator|Generator|GlobIterator|HashContext|InfiniteIterator|InflateContext|InternalIterator|InvalidArgumentException|Iterator(|Aggregate|Iterator)|Json(Exception|Serializable)|LengthException|LibXMLError|LimitIterator|LogicException|MultipleIterator|NoRewindIterator|Open(SSL(AsymmetricKey|Certificate|CertificateSigningRequest))|Out(OfBoundsException|OfRangeException)|OuterIterator|OverflowException|PDO|PDO(Exception|Row|Statement)|ParentIterator|ParseError|Phar(|Data|Exception|FileInfo)|PhpToken|RangeException|Rar(Archive|Entry|Exception)|Recursive(ArrayIterator|CachingIterator|CallbackFilterIterator|DirectoryIterator|FilterIterator|Iterator(|Iterator)|RegexIterator|TreeIterator)|Reflection(|Attribute|Class(|Constant)|Enum(|BackedCase|UnitCase)|Exception|Extension|Fiber|Function(|Abstract)|Generator|IntersectionType|Method|NamedType|Object|Parameter|Property|Reference|Type|UnionType|ZendExtension)|Reflector|RegexIterator|ReturnTypeWillChange|RuntimeException|SeekableIterator|Sensitive(Parameter(|Value))|Serializable|Session(Handler(|Interface)|IdInterface|UpdateTimestampHandlerInterface)|Simple(XML(Element|Iterator))|Spl(DoublyLinkedList|File(Info|Object)|FixedArray|Heap|MaxHeap|MinHeap|ObjectStorage|Observer|PriorityQueue|Queue|Stack|Subject|TempFileObject)|Stringable|Throwable|Traversable|TypeError|UnderflowException|UnexpectedValueException|UnhandledMatchError|UnitEnum|ValueError|Weak(Map|Reference)|XML(Parser|Reader|Writer)|__PHP_Incomplete_Class|finfo|php_user_filter|stdClass)"
yarns
  • 66
  • 8
  • 1
    AFAIK, there is no obvious way to "optimize" regex programmatically. Also, in most cases it is not advisable, because such "optimization" will most likely make understanding of your expression harder. – markalex Mar 12 '23 at 12:26
  • That being said, for your exact case I believe you could start with your initial expression (like `(Example|ArrayIterator|RecursiveArrayIterator|DirectoryIterator|RecursiveDirectoryIterator)` and iteratively shorten it to something like `(Example|(Array|RecursiveArray|Directory|RecursiveDirectory)Iterator)` and than recursively keep shortening inner groups. Since you described your question as exercise, will it be helpful if I provide you with some raw ideas in python? (My php is not sufficient for this) – markalex Mar 12 '23 at 12:33
  • my python iz not very good but anything would help. i tried to break the words up, then get stuck on putting matches into some nested tree perhaps before output. @markalex also the script and raw list of files will be somewhere near the compiled regex so understanding it wont be necessary. – yarns Mar 18 '23 at 01:03
  • if i fixed something (got a litle further) but haven't solved the problem completely yet, am i supposed to add an answer, edit the question or add a comment? – yarns Mar 22 '23 at 02:36
  • 1
    Depend on what you want: to share your code with future generations - add answer, to change setup of question (minor details) - edit question. If you stuck in narrow specific place with your code, I'd recommend to extract [mre] and create separate question. – markalex Mar 22 '23 at 06:07
  • I think we're close to an answer, i'll add what i've got after a little tinkering. i'll focus more on the minimal reproducible example too, thanks @markalex – yarns Mar 23 '23 at 17:02
  • If you had luck converting my code to php, please provide it as an answer here. I was lately asked to remove my answer, as it is "off-language" and of no use for future users. – markalex May 17 '23 at 22:44
  • Without otherthinking it, just use simple alternation without bother to consolidate overlapping substrings. [PHP Regular Expression For The Dynamically Generated String](https://stackoverflow.com/q/28738788/2943403) – mickmackusa May 17 '23 at 23:01
  • @markalex i will, this one is close but i'll clean it up https://3v4l.org/Pqhaq – yarns Jun 01 '23 at 20:15

1 Answers1

0

To create a full blown regex trie is a fairly complex action.
And it absolutely reduces the time to match.

I've done this with a software program.
Just put the string into a field, it parses then pumps out a trie.

Your samples are easily done by hand, but I wouldn't expect those algorithms are
conducive to put in programs.

When you construct a trie, always use the descending method if possible.

Example, DOMNode, DOMText, DOMElement

Descending (recommended) Example|DOM(?:(?:Tex|Elemen)t|Node)
Ascending (not recommended) DOM(?:(?:Elemen|Tex)t|Node)|Example

Example, ArrayIterator, RecursiveArrayIterator, DirectoryIterator, RecursiveDirectoryIterator

Descending (recommended) (?:Recursive(?:Director|Arra)|Director|Arra)yIterator|Example

Ascending (not recommended) (?:Arra|Director|Recursive(?:Arra|Director))yIterator|Example

sln
  • 2,071
  • 1
  • 3
  • 11
  • you mean to use descending method is better for the regex to match? @sln edit: ah i get it, so the shorter alternations are first and match quicker – yarns Mar 18 '23 at 01:09
  • thanks for the references. I was able to look up a few terms i didn't know like ascending and descending trie, even after years using regex :S. i'm diggin into it – yarns Mar 23 '23 at 16:59