22

Is it any function in PHP that check the % of similarity of two strings?

For example i have:

$string1="Hello how are you doing" 
$string2= " hi, how are you"

and the function($string1, $string2) will return me true because the words "how", "are", "you" are present in the line.

Or even better, return me 60% of similarity because "how", "are", "you" is a 3/5 of $string1.

Does any function exist in PHP which do that?

Fabio
  • 23,183
  • 12
  • 55
  • 64
Ilya Libin
  • 1,576
  • 2
  • 17
  • 39
  • 5
    By aware though, that "similar" may mean different things. – Prof. Falken May 13 '13 at 11:18
  • 1
    Please, define "similarity". Is it related to single characters, to words, to phrases? Don't think that `similar_text` will do the work. – enenen May 13 '13 at 11:18
  • 1
    The similar_text function does something like that, but read http://stackoverflow.com/questions/14136349/how-does-similar-text-work to see how it actually works. It might not do what you expect. If you want the percentage of matching words, I would suggest a custom method using some sort of explode on a cleaned string. – Hugo Delsing May 13 '13 at 11:19
  • 1
    @HugoDelsing Yes, actually I need a similarity of the words and not just a single characters. – Ilya Libin May 13 '13 at 11:24
  • 1
    @enenen Yes i mean similar words, not a single characters. – Ilya Libin May 13 '13 at 11:25

6 Answers6

37

As it's a nice question, I put some effort into it:

<?php
$string1="Hello how are you doing";
$string2= " hi, how are you";

echo 'Compare result: ' . compareStrings($string1, $string2) . '%';
//60%


function compareStrings($s1, $s2) {
    //one is empty, so no result
    if (strlen($s1)==0 || strlen($s2)==0) {
        return 0;
    }

    //replace none alphanumeric charactors
    //i left - in case its used to combine words
    $s1clean = preg_replace("/[^A-Za-z0-9-]/", ' ', $s1);
    $s2clean = preg_replace("/[^A-Za-z0-9-]/", ' ', $s2);

    //remove double spaces
    while (strpos($s1clean, "  ")!==false) {
        $s1clean = str_replace("  ", " ", $s1clean);
    }
    while (strpos($s2clean, "  ")!==false) {
        $s2clean = str_replace("  ", " ", $s2clean);
    }

    //create arrays
    $ar1 = explode(" ",$s1clean);
    $ar2 = explode(" ",$s2clean);
    $l1 = count($ar1);
    $l2 = count($ar2);

    //flip the arrays if needed so ar1 is always largest.
    if ($l2>$l1) {
        $t = $ar2;
        $ar2 = $ar1;
        $ar1 = $t;
    }

    //flip array 2, to make the words the keys
    $ar2 = array_flip($ar2);


    $maxwords = max($l1, $l2);
    $matches = 0;

    //find matching words
    foreach($ar1 as $word) {
        if (array_key_exists($word, $ar2))
            $matches++;
    }

    return ($matches / $maxwords) * 100;    
}
?>
Hugo Delsing
  • 13,803
  • 5
  • 45
  • 72
  • 9
    Finally an answer without the useless (in this case) `similar_text`. +1 – enenen May 13 '13 at 11:49
  • 1
    Wow! Thanks for briliant answer! The only problem is that i use strings in different languages. like Japanese, Spanish, Russian. There is another way to make it more interesting and complicated. For example you want to give it additional similarity points, if the words comes in the same order. like "Hello how are you" is ok but "Hello you how are" i less good. – Ilya Libin May 13 '13 at 12:27
  • Also, similar_text can make it more sencetive to mistakes. For example if i write "he walking on the street" and "he walk on the street" it will still ok. – Ilya Libin May 13 '13 at 12:33
  • 2
    Both methods have advantages and downsides. But keep in mind that you requested similar words. Test `$string1="words super cool"; $string2="super cool words";` Mine gives a 100% match and similar text only 62%. Its a matter of what and how you want to check it. Similar text checks for longest parts that match in a string. So also the right order of the words/letters. Which is something that changes a lot when two or more people try to say the same thing in their own words. – Hugo Delsing May 13 '13 at 12:46
  • Excellent solution. Nothing like comparing two exact strings and `similar_text()` producing a `33%` chance of a match. – Tim Hallman Feb 28 '19 at 15:58
  • I would suggest to use ```$s1clean = preg_replace('!\s+!', ' ',$s1clean);``` to replace multiple spaces, line break, tab spaces to single space. – Sadee Sep 22 '21 at 09:45
11

As other answers have already said, you can use similar_text. Here's the demonstration:

$string1="Hello how are you doing" ;
$string2= " hi, how are you";

echo similar_text($string1, $string2, $perc); //12

echo $perc; //61.538461538462

will return 12, and will set in $perc the percentage of similarity as you asked for.

Ankit
  • 115
  • 2
  • 9
Alex Siri
  • 2,856
  • 1
  • 19
  • 24
11

In addition to Alex Siri's answer and according to the following article:

http://docstore.mik.ua/orelly/webprog/php/ch04_06.htm

PHP provides several functions that let you test whether two strings are approximately equal:

$string1="Hello how are you doing" ;
$string2= " hi, how are you";

SOUNDEX

if (soundex($string1) == soundex($string2)) {

  echo "similar";

} else {

  echo "not similar";

}

METAPHONE

if (metaphone($string1) == metaphone($string2)) {

   echo "similar";

} else {

  echo "not similar";

}

SIMILAR TEXT

$similarity = similar_text($string1, $string2);

LEVENSHTEIN

$distance = levenshtein($string1, $string2); 
Hugo Delsing
  • 13,803
  • 5
  • 45
  • 72
RafaSashi
  • 16,483
  • 8
  • 84
  • 94
0

Ok here is my function that makes it much interesting.

I'm checking approximately similarity of strings.

Here is a criteria I use for that.

  1. The order of the words is important
  2. The words can have 85% of similarity.

Example:

$string1 = "How much will it cost to me" (string in vocabulary)
$string2 = "How much does costs it "   //("costs" instead "cost" -is a mistake) (user input);

Algorithm: 1) Check the similarity of words and create clean strings with "right" words (in the order it appear in vocabulary). OUTPUT: "how much it cost" 2) create clean string with "right words" in order it appear in user input. OUTPUT: "how much cost it" 3)Compare two outputs - if not the same - return no, else if same return yes.

error_reporting(E_ALL);
ini_set('display_errors', true);

$string1="сколько это стоит ваще" ;
$string2= "сколько будет стоить это будет мне";

if(compareStrings($string1, $string2)) {
 echo "yes";    
} else {
    echo 'no';
}
//echo compareStrings($string1, $string2);

function compareStrings($s1, $s2) {

    if (strlen($s1)==0 || strlen($s2)==0) {
        return 0;
    }

    while (strpos($s1, "  ")!==false) {
        $s1 = str_replace("  ", " ", $s1);
    }
    while (strpos($s2, "  ")!==false) {
        $s2 = str_replace("  ", " ", $s2);
    }

    $ar1 = explode(" ",$s1);
    $ar2 = explode(" ",$s2);
  //  $array1 = array_flip($ar1);
  //  $array2 = array_flip($ar2);
    $l1 = count($ar1);
    $l2 = count($ar2);

 $meaning="";
    $rightorder="";
    $compare=0;
    for ($i=0;$i<$l1;$i++) {


        for ($j=0;$j<$l2;$j++) {

            $compare = (similar_text($ar1[$i],$ar2[$j],$percent)) ;
          //  echo $compare;
if ($percent>=85) {
    $meaning=$meaning." ".$ar1[$i];
    $rightorder=$rightorder." ".$ar1[$j];
    $compare=0;
}

        }


    }
    //print_r($rightorder);
if ($rightorder==$meaning) {
    return true;
} else {
    return false;
}

}

i would love to hear your opinion and suggestion how to improve it

Tony Stark
  • 8,064
  • 8
  • 44
  • 63
Ilya Libin
  • 1,576
  • 2
  • 17
  • 39
  • Long time ago, but just read this answer. If I enter two completely different strings it returns true, because `$rightorder` and `$meaning` both stay an empty string. – Hugo Delsing Sep 16 '13 at 20:45
0

You can use the PHP function similar_text.

int similar_text ( string $first , string $second)

Check the PHP doc at: http://php.net/manual/en/function.similar-text.php

Salvi Pascual
  • 1,788
  • 17
  • 22
0

Although this question is quite old but just adding my solution due to few reasons. First is that the author desired of comparing similar words rather than string as per his comment. Secondly, most of the answer tried to solve it via similar_text which is not suitable for this problem because it compare the text by characters difference and find the similarity and that results in match of quite different strings too. First answer given by @Hugo Delsing is using array_flip which reverse the keys and values but it will consider only word if key is repeated more than one time. I have posted following answer which will compare the words. The only issue it can give is that it won't consider the order of the words very much.

function compareStrings($s1, $s2)
{
    if (strlen($s1) == 0 || strlen($s2) == 0) {
        return 0;
    }

    $ar1 = preg_split('/[^\w\-]+/', strtolower($s1), null, PREG_SPLIT_NO_EMPTY);
    $ar2 = preg_split('/[^\w\-]+/', strtolower($s2), null, PREG_SPLIT_NO_EMPTY);

    $l1 = count($ar1);
    $l2 = count($ar2);

    $ar2_copy = array_values($ar2);

    $matched_indices = [];
    $word_map = [];
    foreach ($ar1 as $k => $w1) {
        if (isset($word_map[$w1])) {
            if ($word_map[$w1][0] >= $k) {
                $matched_indices[$k] = $word_map[$w1][0];
            }
            array_splice($word_map[$w1], 0, 1);
        } else {
            $indices = array_keys($ar2_copy, $w1);
            $index_count = count($indices);
            if ($index_count) {
                if ($index_count == 1) {
                    $matched_indices[$k] = $indices[0];
                    // remove the word at given index from second array so that it won't repeat again
                    unset($ar2_copy[$indices[0]]);
                } else {
                    $matched_indices[$k] = $indices[0];
                    // remove the word at given indices from second array so that it won't repeat again
                    foreach ($indices as $index) {
                        unset($ar2_copy[$index]);
                    }
                    array_splice($indices, 0, 1);
                    $word_map[$w1] = $indices;
                }
            }
        }
    }
    return round(count($matched_indices) * 100 / $l1, 2);
}
Raheel Shahzad
  • 136
  • 2
  • 13