I'm trying to create a method/function that compares two sentence and returns a percentage of their similarity.
For e.g. in PHP there is a function called similar_text, but it's not working well.
Here I have a few examples that should get a high similartiy when comparing against each other:
In the backyard there is a green tree and the sun is shinnying.
The sun is shinnying in the backyard and there is a green tree too.
A yellow tree is in the backyard with a shinnying sun.
In the front yard there is a green tree and the sun is shinnying.
In the front yard there is a red tree and the sun is no shinnying.
Does anyone know how to get a good example?
I would prefere to use PHP for it, but I don't mind to use Java or Python for it.
In the internet I found this function:
function compareStrings($s1, $s2) {
//one is empty, so no result
if (strlen($s1)==0 || strlen($s2)==0) {
return 0;
}
//replace none alphanumeric charactors
//i left - in case its used to combine words
$s1clean = preg_replace("/[^A-Za-z-]/", ' ', $s1);
$s2clean = preg_replace("/[^A-Za-z-]/", ' ', $s2);
//remove double spaces
$s1clean = str_replace(" ", " ", $s1clean);
$s2clean = str_replace(" ", " ", $s2clean);
//create arrays
$ar1 = explode(" ",$s1clean);
$ar2 = explode(" ",$s2clean);
$l1 = count($ar1);
$l2 = count($ar2);
//flip the arrays if needed so ar1 is always largest.
if ($l2>$l1) {
$t = $ar2;
$ar2 = $ar1;
$ar1 = $t;
}
//flip array 2, to make the words the keys
$ar2 = array_flip($ar2);
$maxwords = max($l1, $l2);
$matches = 0;
//find matching words
foreach($ar1 as $word) {
if (array_key_exists($word, $ar2))
$matches++;
}
return ($matches / $maxwords) * 100;
}
But it's only returning 80%. similar_text is returning just 39%.