PHP array sorting on relevance

Question

I have a PHP text array, which holds values like "Blue Pencil, Blue Pen, Blue, Red Pencil, Red Ink, Red Pen, Blue Notebook, etc...."

I need to run through each array item, and show the matching results in order of matching RELEVANCE. Like, if the user searches for the term "Blue", then the 3rd item "Blue" which is a perfect match should get listed at the top, followed by 2nd item "Blue Pen", then by 1st "Blue Pencil" and finally by "Blue Notebook". Rest all non-Blue items will be discarded.

I tried using the sort and rsort functions on PHP arrays (both before and after pulling matching Blue items), but they simply sort based on alphabetical and reverse-alpha listing. There is no relevance match in there. Like using sort($array) returns the following

Blue
Blue Notebook
Blue Pen
Blue Pencil

which is NOT really as per the expected "relevant" result.

Also the levenshtein function does NOT fit, as it has a restriction that it works on strings with maximum length 255. My strings can be longer.

To draw a parallel, MySQL has this match-against clause which does the work.

SELECT * , MATCH (col1, col2) AGAINST ('some words' IN NATURAL LANGUAGE MODE)

Looking for something similar in PHP, if anyone can provide any pointers or any UDF to be written.

This has nothing to do with MySQL. Everything is in PHP. I mentioned MySQL example as I am looking for similar function in PHP — Aquaholic, May 17 '22 at 12:08
@0stone0 - thanks, but that levenshtein function has a restriction that it works on strings with maximum length 255. My strings can be longer. — Aquaholic, May 17 '22 at 12:36
@0stone0 - This is really strange - Question is closed by some admin without even looking at the whole requirement. Someone simply said "levenshtein" function, and q was closed. That levenshtein function has restrictions which don't match my needs. Please re-open this question. SO admins often get too restrictive :-( — Aquaholic, May 17 '22 at 12:40
There was no admin involved. And nowhere you mentioned such requirements. — gre_gor, May 17 '22 at 12:44
@Aquaholic please [edit] your question with more details. Eg, why the marked duplicate does not answer your question. — 0stone0, May 17 '22 at 13:05
@0stone0 - I clearly mentioned it in the above comments (responding to yours) that levenshtein function doesnt fit. Anyways, have edited the q once again. Somehow, SO ppl are becoming unnecessarily too strict. Q has already got 3 answers and 2 upvotes, but still the few mark it for anything. Bad for unique questions, and for SO community — Aquaholic, May 17 '22 at 13:57
The q contains the clear desc of the problem, what was tried and NOT working, why existing functions (including suggested) dont fit the bill, and what is the expected output. And yet, it is marked duplicate (of another marked duplicate). — Aquaholic, May 17 '22 at 13:59
@Aquaholic, you added your 255 character `levenshtein()` detail _after_ the question was closed. The comment speaking to the same point is irrelevant. All relevant information must be in the _question_, as comments can be deleted at any time for any reason. [They are "temporary post-it notes"](https://stackoverflow.com/help/privileges/comment). — ChrisGPT was on strike, May 17 '22 at 16:04
Thanks @Chris. I hope u understand the issue. Members with privileges are in such a hurry to mark unique & genuine q's to close/duplicate, that they can't even allow some time (like say, 24-hours) for q's to be updated/deleted by the asker after comments are exchanged. For everyone's benefit, may be SO should impose this minimum 24-hour timeline before a q can be closed/marked dupe. It gets really irritating - one struggles with code, looks for help, puts time/effort in drafting the q with absolute clarity, & due to extreme hurry of certain ones the genuine q gets closed/marked dupe :(( — Aquaholic, May 17 '22 at 16:20
@Aquaholic, I hear your frustration and I don't have a good solution. In fairness, this question _was_ a duplicate when it was marked as such. It has since been edited and is in the reopen queue so other users can evaluate it for reopening. There isn't a 24 hour period as you suggest, but the "closed" state is sort of an intermediate state before deletion. I know that's not what you're looking for, but it's the system we have. A 24 hour protected window would mean really bad questions (obvious dupes, unclear questions, even spam) would remain on the site for a full day. — ChrisGPT was on strike, May 17 '22 at 16:31
Maybe this question would be a better dupe candidate? I don't have a gold badge for any of the tags on this question so can't change the dupe target myself. [String similarity in PHP: levenshtein like function for long strings](https://stackoverflow.com/q/5092708/354577) — ChrisGPT was on strike, May 17 '22 at 16:33

score 1 · Answer 1 · answered May 17 '22 at 12:37

Fulltext search is a complex subject.

A combination of array_filter and usort with levenshtein will result in the answer you want for this particular query, but you will find that it quickly falls apart for other queries:


$data = explode(', ', 'Blue Pencil, Blue Pen, Blue, Red Pencil, Red Ink, Red Pen, Blue Notebook');
$query = 'Blue';

// Do an exact match first
$data = array_filter($data, fn ($s) => str_contains($s, $query));

// Sort by the Levenshtein distance from the $query
usort($data, fn($a, $b) => levenshtein($query, $a) - levenshtein($query, $b));

var_dump($data);

// Will print: 
// array(4) {
//    [0]=>
//   string(4) "Blue"
//   [1]=>
//   string(8) "Blue Pen"
//   [2]=>
//   string(11) "Blue Pencil"
//   [3]=>
//   string(13) "Blue Notebook"
// }

Think about:

What happens if a user uses different capitalization (exact match won't work)
What if a user is looking for "a blue notebook" (you'd need some kind of string tokenization)
Do you want to remove/ignore certain words? (such as "the", "a", etc.)
What happens if you have thousands of words to look through? This solution won't be very performant.

You may eventually find that you end up reaching for a true search engine, such as Apache Lucene or Elasticsearch.

Thanks @PietervandenHam - allow me some time to try the solution, and I'll revert. (will delete this comment then) — Aquaholic, May 17 '22 at 14:04
Hi Pieter, in excat match line, you are using single argument - fn ($s). While in the usort line, you are using two arguments - fn($a, $b). Sorry for my limited knowledge of such functions, will you be able to clarify this please, as it is throwing error. — Aquaholic, May 17 '22 at 15:40

score 0 · Answer 2 · answered May 17 '22 at 12:47

0

$input = [ "Blue Pencil", "Blue Pen", "Blue", "Red Pencil", "Red Ink", "Red Pen", "Blue Notebook" ];

$result = preg_grep("/^blue/i", $input); print_r($result);

answered May 17 '22 at 12:47

Tejas Gurav

32
2

1

Thanks @TejasGurav - this works after the additional sorting you mentioned in your comment (+1). Additionally, Pieter has rightly mentioned some more thoughtful inputs in his answer, so allow me some time to try the broader "relevance matching", before I pick the right answer from all you good people. Appreciate your inputs. – Aquaholic May 17 '22 at 15:27

score -2 · Answer 3 · answered May 17 '22 at 12:24

-2

You need asort function

$data = array("Blue Pencil", "Blue Pen", "Blue", "Red Pencil", "Red Ink", "Red Pen", "Blue Notebook");
asort($data);
print_r($data);

Output

Array ( [2] => Blue [6] => Blue Notebook [1] => Blue Pen [0] => Blue Pencil [4] => Red Ink [5] => Red Pen [3] => Red Pencil )

answered May 17 '22 at 12:24

RIZI

104
1
7

Thanks @RIZI - but your answer is not the solution. Please re-read the question. Thanks again anyways. – Aquaholic May 17 '22 at 14:09

PHP array sorting on relevance

3 Answers3