Search groups of string where order of groups is irrelevant

Question

I have groups of some string and I need find all groups by regular expression where order of groups is irrelevant

Please, I need locate all necessary ingredients in user's answer. User can put ingredients in any order and he can delimited by any character or string (space, comma) or delimiter is not necessary.

$string = "banana, strawberry, cherry and chocolate";
$regex = "/(banana)*(strawberry)*(cherry)*(chocolate)/";
if (preg_match($regex, $string)) {
 // do something
}

The problem in my code is, that if user's answer is "strawberry, banana, cherry", preg_match validate this as true which is bad because chocolate is also necessary in answer. Or if I type "strwberry" instead of strawberry, is it true also. User's answer must including all 4 ingredients in any order and he cannot have typos in name of ingredients. Thank you very much for any hint.

_..or delimiter is not necessary..._ Huh? What about `bananastrawberrycherry`? Would this be valid? — B001ᛦ, Jun 24 '19 at 08:59
in my mind there is no regex needed just do 4 strpos checks and check if all are truely — Kapsonfire, Jun 24 '19 at 09:01
bananastrawberrycherry shouldn't be vaid, but bananastrawberrycherrychocolate should be valid please — Bambi Bunny, Jun 24 '19 at 09:06
try this : `'/(banana).*(strawberry).*(cherry).*(chocolate)/'`. It will work even for these 'bananastrawberrycherrychocolate — sujeet, Jun 24 '19 at 09:09
@Kapsonfire I know but I think that in regex is it more elegant and less of code, isn't is? :) — Bambi Bunny, Jun 24 '19 at 09:15
@SujeetAgrahari It doesn't work because for example "strawberry,cherry,chocolate,banana" doesn't work. The order of ingredients must be irrelevant — Bambi Bunny, Jun 24 '19 at 09:25
You can loop over it. This code will work, I think, in all cases. I have also made the search case-insenstive. ```$string = "banana, strawberry, cherry and chocolate"; $answers = ["banana","cherry","chocolate","strawberry"]; foreach ($answers as $answer) { if(preg_match("/($answer)/i",$string,$matches)) { var_dump($matches[1]); } }``` — sujeet, Jun 24 '19 at 09:49
OK, the best solution what I found is `^(?=.*\bstrawberry\b)(?=.*\bcherry\b)(?=.*\bchocolate\b)(?=.*\bbanana\b).*$` but in this case, ingredients must be delimited by some char. I think that it's easier force users to delimiting values — Bambi Bunny, Jun 24 '19 at 10:58
regex is slower and i dont think its more elegant. you can even write a function like hasAllKeywords(array $keywords) — Kapsonfire, Jun 24 '19 at 11:36

Casimir et Hippolyte · Accepted Answer · 2019-06-26T23:11:08.990

About your request:

User can put ingredients in any order and he can delimited by any char or string (space, comma) or delimiter is not necessary.

The order of ingredients isn't a problem, we will see that later. But to do without delimiters is a very bad idea ! Consider the following example (a fruit salad):

$ingredients = ['melon', 'orange', 'grape', 'apple'];
$userAnswer = 'watermelonorangegrapeapple';

The problem is obvious, there is no way to differentiate "melon" from "watermelon" with this type of constraint that will cause false positives.

Don't forget that a user is responsible of what he writes and will learn from his own errors when he doesn't obtain the desired result. An other way consists to force the user to enter ingredients one by one using input fields.

User's answer must include all 4 ingredients in any order and he cannot have typos in the name of ingredients.

Why not, but this time you are too much constrictive in my opinion: What if the user write "strawberries" and not "strawberry" ? It isn't really a typo, I think it's acceptable.

Possibilities:

Lets assume that everything is for the best in the best of all possible worlds: words are delimited and there's no typo.

As suggested in the previously linked question, you can do:

if ( preg_match('~(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)(?=.*\bword4\b)~Ai', $userAnswer) ) {
    //...
}

But it isn't the compact, right to the point way of your dreams:

It doesn't take in account delimiters.
You have to build dynamically the pattern for each ingredients list. (However it isn't difficult)
Each lookahead has to go through the whole string.
It isn't flexible nor scalable at all.
If you have doubts about points 2 to 5, see the point 1.

Other approach: you can split the user string with the delimiter and use array_diff to see if each ingredient is present.

Basic:

$delimiter = '~ \b \s* (?: , \s* | \s and \s+ ) ~uxi';

$parts = preg_split($delimiter, $userAnswer, -1, PREG_SPLIT_NO_EMPTY);

if ( empty(array_diff($ingredients, $parts)) ) {
    // all ingredients are here
}

With a sanitization:

$delimiter = '~ \b (?: [ ]? , [ ]? | [ ] and [ ] ) ~ux';

$userAnswer = trim(preg_replace('~[\s\pP]+~u', ' ', mb_strtolower($userAnswer)));

$parts = preg_split($delimiter, $userAnswer);

if ( empty(array_diff($ingredients, $parts)) ) {
    // all ingredients are here
}

With a lenient comparison between strings:

$delimiter = '~ \b (?: [ ]? , [ ]? | [ ] and [ ] ) ~ux';

$userAnswer = trim(preg_replace('~[\s\pP]+~', ' ', mb_strtolower($userAnswer)));

$parts = preg_split($delimiter, $userAnswer);

if ( empty(array_udiff($ingredients, $parts, $callback)) ) {
    // all ingredients are here
}

Callback function example:

Callback functions for array_udiff are nothing more than comparison functions to sort an array, in other words, sorting is a necessary step under the hood to compare two arrays. That's why a comparison between two items should result in a positive, negative integer or 0 to determine the order.

PHP has two functions to perform a fuzzy comparison between strings: similar_text() and levenshtein().

An example using the levenshtein distance. Less than 2 means that only one character can be replaced, inserted or deleted to make the two strings equal (see the PHP manual for more control).

$callback = function ($a, $b) {
    return levenshtein($a, $b) < 2 ? 0 
                                   : ( $a < $b ? -1 : 1 ); 
}

Note that these two functions may have a non negligible cost for long strings since similar_text() is O(max(m,n)^3) and levenshtein() is O(m*n) (m and n are the lengths of the strings). If it becomes problematic, you can also use functions like metaphone() or soundex() to transform the string before comparing them or write a transformation of your own. This involves having to modify the data structure containing the ingredients in advance in order to make the comparison easier.

Search groups of string where order of groups is irrelevant

1 Answers1