1

I have a list of string/regex that I want to check if its matched from the string input.
Lets just say I have these lists:

$list = [ // an array list of string/regex that i want to check
  "lorem ipsum", // a words
  "example", // another word
  "/(nulla)/", // a regex
];

And the string:

$input_string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer quam ex, vestibulum sed laoreet auctor, iaculis eget velit. Donec mattis, nulla ac suscipit maximus, leo  metus vestibulum eros, nec finibus nisl dui ut est. Nam tristique varius mauris, a faucibus augue.";

And so, I want it to check like this:

if( $matched_string >= 1 ){ // check if there was more than 1 string matched or something...
 // do something...
 // output matched string: "lorem ipsum", "nulla"
}else{
 // nothing matched
}

How can I do something like that?

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
Tunku Salim
  • 167
  • 1
  • 9

3 Answers3

1

I'm not sure if this approach would work for your case but, you could treat them all like regexes.

$list = [ // an array list of string/regex that i want to check
  "lorem ipsum", // a words
  "Donec mattis",
  "example", // another word
  "/(nulla)/", // a regex
  "/lorem/i"
];
$input_string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer quam ex, vestibulum sed laoreet auctor, iaculis eget velit. Donec mattis, nulla ac suscipit maximus, leo  metus vestibulum eros, nec finibus nisl dui ut est. Nam tristique varius mauris, a faucibus augue.";

$is_regex = '/^\/.*\/[igm]*$/';
$list_matches = [];
foreach($list as $str){
    // create a regex from the string if it isn't already
    $patt = (preg_match($is_regex, $str))? $str: "/$str/";
    $item_matches = [];
    preg_match($patt, $input_string, $item_matches);
    if(!empty($item_matches)){
        // only add to the list if matches
        $list_matches[$str] = $item_matches;
    }
}
if(empty($list_matches)){
    echo 'No matches from the list found';
}else{
    var_export($list_matches);
}

The above will output the following:

array (
  'Donec mattis' => 
  array (
    0 => 'Donec mattis',
  ),
  '/(nulla)/' => 
  array (
    0 => 'nulla',
    1 => 'nulla',
  ),
  '/lorem/i' => 
  array (
    0 => 'Lorem',
  ),
)

Sandbox

Arleigh Hix
  • 9,990
  • 1
  • 14
  • 31
1

Try the following:

<?php
$input_string = "assasins: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer quam ex, vestibulum sed laoreet auctor, iaculis eget velit. Donec mattis, nulla ac suscipit maximus, leo  metus vestibulum eros, nec finibus nisl dui ut est. Nam tristique varius mauris, a faucibus augue.";

$list = [ // an array list of string/regex that i want to check
"ass", // should match the ass in assasins
"Lorem ipsum", // a words
"consectetur", // another word
"/(nu[a-z]{2}a)/", // a regex
];
$regex_list = [];
foreach($list as $line) {
    if ($line[0] == '/' and $line[-1] == '/')
        $regex = '(?:' . substr($line, 1, -1) . ')';
    else
        $regex = '\\b' . preg_quote($line, $delimiter='/') . '\\b';
    $regex_list[] = $regex;
}
$regex = '/' . implode('|', $regex_list) . '/';
echo "$regex\n";
preg_match_all($regex, $input_string, $matches, PREG_SET_ORDER);
print_r($matches);

$s = [];
foreach ($matches as &$match) {
    $s[] = $match[0];
}
$s = json_encode($s);
echo "Matched strings: ", substr($s, 1, -1), "\n";

Prints:

/\bass\b|\bLorem ipsum\b|\bconsectetur\b|(?:(nu[a-z]{2}a))/
Array
(
    [0] => Array
        (
            [0] => Lorem ipsum
        )

    [1] => Array
        (
            [0] => consectetur
        )

    [2] => Array
        (
            [0] => nulla
            [1] => nulla
        )

)
Matched strings: "Lorem ipsum","consectetur","nulla"

Discussion and Limitations

In processing each element of $list, if the string begins and ends with '/', it is assumed to be a regular expression and the '/' characters are removed from the start and end of the string. Therefore, anything else that does not begin and end with these characters must be a plain string. This implies that if the OP wanted to match a plain string that just happens to begin and end with '/', e.g. '/./', they would have to do it instead as a regular expression: '/\/.\//'. A plain string is replaced by the results of calling preg_quote on it to escape special characters that have meaning in regular expressions thus converting it into a regex without the opening and closing '/' delimiters. Finally, all the strings are joined together with the regular expression or character, '|', and then prepended and appended with '/' characters to create a single regular expression from the input.

The main limitation is that this does not automatically adjust backreference numbers if multiple regular expressions in the input list have capture groups, since the group numberings will be effected when the regular expressions are combined. Therefore such regex patterns must be cognizant of prior regex patterns that have capture groups and adjust its backreferences accordingly (see demo below).

Regex flags (i.e. pattern modifiers) must be embedded within the regex itself. Since such flags in one regex string of $list will effect the processing of another regex string, if flags are used in one regex that do not apply to a subsequent regex, then the flags must be specifically turned off:

<?php
$input_string = "This is an example by Booboo.";

$list = [ // an array list of string/regex that i want to check
"/(?i)booboo/", // case insensitive
"/(?-i)EXAMPLE/" // explicitly not case sensitive
];
$regex_list = [];
foreach($list as $line) {
    if ($line[0] == '/' and $line[-1] == '/')
        $regex_list[] = substr($line, 1, -1);
    else
        $regex_list[] = preg_quote($line, $delimiter='/');
}
$regex = '/' . implode('|', $regex_list) . '/';
echo $regex, "\n";
preg_match_all($regex, $input_string, $matches, PREG_SET_ORDER);
print_r($matches);

$s = [];
foreach ($matches as &$match) {
    $s[] = $match[0];
}
$s = json_encode($s);
echo "Matched strings: ", substr($s, 1, -1), "\n";

Prints:

/(?i)booboo|(?-i)EXAMPLE/
Array
(
    [0] => Array
        (
            [0] => Booboo
        )

)
Matched strings: "Booboo"

This shows how to correctly handle backreferences by manually adjusting the group numbers:

<?php
$input_string = "This is the 22nd example by Booboo.";

$list = [ // an array list of string/regex that i want to check
"/([0-9])\\1/", // two consecutive identical digits
"/(?i)([a-z])\\2/" // two consecutive identical alphas
];
$regex_list = [];
foreach($list as $line) {
    if ($line[0] == '/' and $line[-1] == '/')
        $regex_list[] = substr($line, 1, -1);
    else
        $regex_list[] = preg_quote($line, $delimiter='/');
}
$regex = '/' . implode('|', $regex_list) . '/';
echo $regex, "\n";
preg_match_all($regex, $input_string, $matches, PREG_SET_ORDER);
print_r($matches);

$s = [];
foreach ($matches as &$match) {
    $s[] = $match[0];
}
$s = json_encode($s);
echo "Matched strings: ", substr($s, 1, -1), "\n";

Prints:

/([0-9])\1|(?i)([a-z])\2/
Array
(
    [0] => Array
        (
            [0] => 22
            [1] => 2
        )

    [1] => Array
        (
            [0] => oo
            [1] =>
            [2] => o
        )

    [2] => Array
        (
            [0] => oo
            [1] =>
            [2] => o
        )

)
Matched strings: "22","oo","oo"
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • I do not recommend this answer because it makes the mistake of implementing `preg_quote()` without declaring a slash as the second function parameter. – mickmackusa Nov 28 '22 at 00:57
  • @mickmackusa You make a good point and I have updated my answer accordingly. – Booboo Nov 28 '22 at 03:09
  • This answer may not be reliable if pattern delimiters other than a forward slash are used. This answer may not be reliable if pattern modifiers are added after the ending pattern delimiter. – mickmackusa Nov 28 '22 at 03:27
  • @mickmackusa See revised Limitations section on how regex pattern modifiers are to be handled. – Booboo Nov 28 '22 at 12:16
  • It is not necessary to declare `$match` as "modifiable by reference" inside of `foreach()`, you are not modifying it. To comply with PSR-12 guidelines, curly braces should be used with `if` and `else`. I avoid using `and` in PHP to prevent unintended "precedence" bugs -- not that I suspect a problem here. – mickmackusa Nov 28 '22 at 19:58
  • @Booboo welp, it also matched the word that wasnt supposed to match, like for example if the list was contain "ass", it also matched "assasins" or any word that contain "ass", is there any way to avoid that? – Tunku Salim Jan 29 '23 at 18:18
  • This is what I did: (1) I added 'assasins' to the beginning of `$input_string`. (2) I improved the code to surround every regex in `$list` with `(?:` and ')' (turning it into a non-capturing group) just in case the regex contains the '|' character. (3) To interpret the non-regex entries in $list as *words* rather than just *strings* I have surrounded the string with '\b' which will only match the string if it is on a *word boundary*. That is, the characters to the left and right must not be other letters. It will still match the 'ass' in '+ass+'. – Booboo Jan 29 '23 at 22:23
  • @Tunku Since you are only engaging with Booboo, does this mean that you are disinterested in my answer? Should I not bother extending my answer to include your late requirements? – mickmackusa Jan 30 '23 at 00:50
  • @mickmackusa Lets say you had `$list = ["/\\bd\\w*r\\b/", "/\\b\\w{5}\\b/"];` to match a word of any size beginning with 'd' and ending in 'r' or any 5-letter word and you had `$input_string = "dolor or more";`. You would then be matching the *same* 'dolor' occurrence twice. But my understanding is that the OP wants to know how many substrings in the input can be matched and not how many regexes in `$list` can match something in the input. (more...) – Booboo Jan 30 '23 at 12:34
  • @mickmackus Moreover, if the input string actually had 'dolor' twice (obviously in different positions), e.g. 'dolor dolor', and we had `$list = ["/\\bd\\w*r\\b/"];`, then my solution would match both occurrences but your solution would imply there was only a single match. So if the OP wanted to know. for example, how many times a number appears in the input, you couldn't possibly answer that *currently*. – Booboo Jan 30 '23 at 12:35
  • @mickmackusa Now I will admit that the title of the question suggests your approach could be the correct one. The OP needs to chime in with a comment or a change of which answer is accepted. – Booboo Jan 30 '23 at 12:49
1

Typically, I scream bloody murder if someone dares to stink up their code with error suppressors. If your input data is so out-of-your-control that you are allowing a mix of regex an non-regex input strings, then I guess you'll probably condone @ in your code as well.

Validate the search string to be regex or not as demonstrated here. If it is not a valid regex, then wrap it in delimiters and call preg_quote() to form a valid regex pattern before passing it to the actual haystack string.

Code: (Demo)

$list = [ // an array list of string/regex that i want to check
  "lorem ipsum", // a words
  "example", // another word
  "/(nulla)/", // a valid regex
  "/[,.]/", // a valid regex
  "^dolor^", // a valid regex
  "/path/to/dir/", // not a valid regex
  "[integer]i", // valid regex not implementing a character class
];

$input_string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer quam ex, vestibulum sed laoreet auctor, iaculis eget velit. Donec mattis, /path/to/dir/ nulla ac suscipit maximus, leo  metus vestibulum eros, nec finibus nisl dui ut est. Nam tristique varius mauris, a faucibus augue.";

$result = [];
foreach($list as $v) {
    if (@preg_match($v, '') === false) {
        // not a regex, make into one
        $v = '/' . preg_quote($v, '/') . '/';
    }
    preg_match($v, $input_string, $m);
    $result[$v] = $m[0] ?? null;
}
var_export($result);

Or you could write the same thing this way, but I don't know if there is any drag in performance by checking the pattern against a non-empty string: (Demo)

$result = [];
foreach($list as $v) {
    if (@preg_match($v, $input_string, $m) === false) {
        preg_match('/' . preg_quote($v, '/') . '/', $input_string, $m);
    }
    $result[$v] = $m[0] ?? null;
}
var_export($result);
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • The OP wanted all matched strings so what if a given regex matched multiple occurrences in the input? So I think you want to be using `preg_match_all`. – Booboo Nov 28 '22 at 11:04
  • There is a lack of specificity in the problem definition, so It's not unreasonable to assume that the OP consistently uses '/' as the regex delimiters and therefore anything else that does not begin and end with these characters must be a plain string. This implies that if the OP wanted to match a plain string that just happens to begin and end with '/', e.g. '/./', they would have to do it instead as a regular expression: '/\\/.\\//'. Furthermore, this implies that you will erroneously consider '|.|' to be a regex because of the way you are testing for a regex. – Booboo Nov 28 '22 at 11:57
  • I would not consider `|.|` to be erroneously considered regex -- it is valid regex and can logically be treated as such within the scope of this question. For an input that may or may not be a regex pattern, it would be a flaw in the application if it did not respect a valid pattern. If the input does not give the result that the user/developer wanted, then the onus is on them to craft a better search string. – mickmackusa Nov 28 '22 at 20:02