Find similar words in an array and eliminate them

Question

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

foreach($a as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
paris
london tour
london tours
london
londonn

I can eliminate the same words with array_unique

foreach(array_unique($a) as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
london tour
london tours
londonn

I want to take this further and eliminate similar words. Like, if there is a "london", I want to eliminate "londonn".

So the output will be:

paris
london
london tour

I tried similar_text($name, $name, $percent) but it did not help.

Here is what I tried with my limited of knowledge:

foreach(array_unique($a) as $name) {

$test = $a;
foreach($test as $test1) {

 similar_text($name, $test1, $percent);
if ($percent > 90) {
echo $name;
echo '<br>';
} 

}
}

Output:

paris
paris
london
london
london
london tour
london tour
london tours
london tours
londonn
londonn
londonn

The source of the words is a search list:

$a[] = "$popular_search";

According to `similar_text()` "london" and "londonn" are 92.3% similar. So you could, for instance, see everything above 90% as the same. Please show what you tried and why that didn't help. — KIKO Software, Sep 20 '22 at 14:12
Using the [Levenshtein distance](https://www.php.net/manual/en/function.levenshtein.php) might be more useful than similar_text (it's at least a little easier to use). But either way - please tell us exactly what you've already tried with it, because it's definitely an option for situations like this. — iainn, Sep 20 '22 at 14:13
I edited the question and added what I tried with similar_text. — wp-ap, Sep 20 '22 at 14:16
Are you really looking for something as complex as Levenshtein distances, or are you simply looking if [one string is included in the other](https://stackoverflow.com/questions/4366730/how-do-i-check-if-a-string-contains-a-specific-word)? — Don't Panic, Sep 20 '22 at 14:18
I have a list of popular searches. Like `$a[] = "$popular_search";` I do not know how to with it similar_text. — wp-ap, Sep 20 '22 at 14:22

score 3 · Accepted Answer · answered Sep 20 '22 at 14:30

The main problem seems to be the way you use the two nested loops. Here's a very explicit example, without anything fancy, showing how you could do this:

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

$b = [];
foreach($a as $outerName) {
    // start optimistic, no similar string found
    $isUnique = true;
    foreach($b as $innerName) {
        // check whether the string already has a similar entry
        similar_text($outerName, $innerName, $percent);
        if ($percent > 90) {
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) {
        $b[] = $outerName;
    }
}

print_r($b);

Working example

The output is:

Array
(
    [0] => paris
    [1] => london
    [2] => london tour
)

How does it work? There's an outer loop that simply goes through all the strings in array $a. Inside that loop it loops through the strings $b that have already been identified as being unique enough. If a string from $a is similar enough to a string of $b we skip it. That's all.

Thank you very much. Much appreciated. Your code works perfectly. I will take a time to study on it. — wp-ap, Sep 20 '22 at 14:34

Brian · Answer 2 · 2022-09-20T14:28:49.380

You can use the %percent part that the function returns... This returns a percentage of similarity between the 2 inputs.

For a word game I implemented, I used this approach and for me to 'match' the word(s), testing for a percentage of >= 60 to 80 seemed to work for 'most' of my test cases, depends how picky you want it to be!

For my case, to get it accurate, I actually converted the test words to metaphones first:

public static function testMetaphone($s1 = "", $s2 = "", $phonemes = 4)
{
    if (empty($s1) || empty($s2)) {
        return false;
    }

    $m1 = metaphone($s1, $phonemes);
    $m2 = metaphone($s2, $phonemes);
    $sim = similar_text($m1, $m2, $perc);
    $logMessage = "M1: {$m1}, M2: {$m2}, Similarity: $sim ($perc %) - Originals text: {$s1} | {$s2}";
    Log::info("testMetaphone: " . $logMessage);
    // Test accuracy
    if ($perc >= 85) {
        return true;
    } else {
        return false;
    }
}

Usage:

$answerCheck = testMetaphone("Toyota", "Totota", 6);

See it in action: https://3v4l.org/KceXD - The above fails, if %-age is 85% but passes if %60. So, again may need to play with that to find where YOU are happy with its accuracy.

For you're case you can loop the array and compare each element with every other element using this function and keep track of each word checked and how many similar entries there is and delete then 'duplicates' accordingly.

Find similar words in an array and eliminate them

2 Answers2