This is a fun little problem where you need to assign some sort of score to each element of your haystack (the array of words). I think the best way to score each element is based on a classic dynamic programming problem "longest common substring". The longer this substring, the higher the sort score.
//find the longest commson substring between 2 strings, return all substrings of that length
function longestCommonSubstring($string1, $string2) {
$helper = array();
//create two dimensional array, to keep track
for($i =0; $i < strlen($string1); $i++) {
$helper[$i] = array();
for($j=0; $j< strlen($string2); $j++) {
//intialize all values to 0
$helper[$i][] = 0;
}
}
$max= 0;
$ans = array();
for($i =0; $i <strlen($string1); $i++) {
for($j =0; $j < strlen($string2); $j++) {
if ($string1[$i] == $string2[$j]) {
if($i==0 || $j==0) {
$helper[$i][$j] = 1;
} else {
$helper[$i][$j] = $helper[$i-1][$j-1] + 1;
}
if ($helper[$i][$j] > $max) {
$max = $helper[$i][$j];
$ans = array(substr($string1, $i-$max+1, $max));
} elseif($helper[$i][$j] == $max) {
$ans[] = substr($string1, $i-$max+1, $max);
}
} else {
$helper[$i][$j] = 0;
}
}
}
return $ans;
}
Now that the function is written we need to use it.
foreach($words as $word) {
$lcs = longestCommonSubstring($keyword, $word);
}
Well, that is all and good, but just using the function is only half the battle, now we need to apply some logic to results. Let's save the results in an array and give each word a score. A good score would be the length of longest substring. football
would be a better match than ball
because it has a longer string in common. But what about football
and football player
, they both have the same length of longest common substring? To solve this problem, we could use length as a percentage of the total word length. Combining these two idea, longest substring length and percentage, we get the code below.
//an associative array to save the scores
// $wordsMeta[$word] = array(lengthOfCommonSubstring, percentageOfWordMatched)
$wordsMeta = array();
//go through each word and assign a score
foreach($words as $word) {
$lcs = longestCommonSubstring($keyword, $word);
if (count($lcs) ==0 ) {
$wordPercentage = 0;
$wordLength = 0;
} else {
$wordLength = strlen($lcs[0]);
$wordPercentage = $wordLength/strlen($word);
}
$wordsMeta[$word] = array(
"percentageOfWordMatched" => $wordPercentage,
"lengthOfCommonSubstring" => $wordLength
);
}
Now we just need a sorting function that looks at length first, if they are equal, it will look at percentage and return an appropriate integer.
//our special sorting function
//checks length, if that is equal, then it checks percentage of word matched
//if both are eqaul, then those two elements are considered equal
$sort = function($a, $b) {
$ans = $a["lengthOfCommonSubstring"] - $b["lengthOfCommonSubstring"];
if ($ans == 0) {
$ans = $a["percentageOfWordMatched"] - $b["percentageOfWordMatched"];
}
if ($ans < 0) {
$ans = -1;
} elseif ($ans > 0){
$ans = 1;
} else {
$ans = 0;
}
//higher number = lower sort order
$ans *= -1;
return $ans;
};
Now the easy part: uasort($wordsMeta)
and $answer= array_keys($wordsMeta)
Beware of Demons - This algorithm is slow. Very slow. lcs
is O(n*m)
, and we are calling it count($words)
times. Making the scoring process O(n*m*x)
where:
n
is strlen($keyword)
m
is strlen($word)
x
is count($words)
Additionally we are sorting, which is O(n * log(n))
. So in total this algorithm is O(n*m*x + n*log(n))
which is not good. Keeping the wordlist short, the words in the word list short, and the keyword short will keep speed down.