0

I'm trying to create a method/function that compares two sentence and returns a percentage of their similarity.

For e.g. in PHP there is a function called similar_text, but it's not working well.

Here I have a few examples that should get a high similartiy when comparing against each other:

In the backyard there is a green tree and the sun is shinnying.
The sun is shinnying in the backyard and there is a green tree too.
A yellow tree is in the backyard with a shinnying sun.
In the front yard there is a green tree and the sun is shinnying.
In the front yard there is a red tree and the sun is no shinnying.

Does anyone know how to get a good example?

I would prefere to use PHP for it, but I don't mind to use Java or Python for it.

In the internet I found this function:

function compareStrings($s1, $s2) {
    //one is empty, so no result
    if (strlen($s1)==0 || strlen($s2)==0) {
        return 0;
    }

    //replace none alphanumeric charactors
    //i left - in case its used to combine words
    $s1clean = preg_replace("/[^A-Za-z-]/", ' ', $s1);
    $s2clean = preg_replace("/[^A-Za-z-]/", ' ', $s2);

    //remove double spaces
    $s1clean = str_replace("  ", " ", $s1clean);
    $s2clean = str_replace("  ", " ", $s2clean);

    //create arrays
    $ar1 = explode(" ",$s1clean);
    $ar2 = explode(" ",$s2clean);
    $l1 = count($ar1);
    $l2 = count($ar2);

    //flip the arrays if needed so ar1 is always largest.
    if ($l2>$l1) {
        $t = $ar2;
        $ar2 = $ar1;
        $ar1 = $t;
    }

    //flip array 2, to make the words the keys
    $ar2 = array_flip($ar2);


    $maxwords = max($l1, $l2);
    $matches = 0;

    //find matching words
    foreach($ar1 as $word) {
        if (array_key_exists($word, $ar2))
            $matches++;
    }

    return ($matches / $maxwords) * 100;    
}

But it's only returning 80%. similar_text is returning just 39%.

Mr.Tr33
  • 838
  • 2
  • 18
  • 42
  • 1
    It is hard to compute similarity when you haven't *defined* similarity. You should describe what you are trying to do and explain *why* `similar_text` doesn't do what you want it to do. – John Coleman May 06 '16 at 14:10
  • For example I have the first sentence from my example above. Now I read a text and find a sentence that is close to my first sentence. Then I want to get a percentage of how high they similiar to each other. When I'm comparing the first two sentences with similar_text, it's returning 39%. But the words are just on different positions and the second one has one word more. – Mr.Tr33 May 06 '16 at 14:20
  • `similar_text()` is the best bet you got. If you got plenty of processing power, why not assign **individual word** to an array and very **expensively** compare arrays to arrays? –  May 06 '16 at 14:21
  • @Mr.Tr33 You are just repeating yourself. What do you *mean* by "percentage of how similar they are"? You haven't defined what you mean by similarity, but have just informed us that it differs from the meaning of "similiarity" used in `similar_text()`. You have provided an anti-specification (do something different from `similar_text()`) rather than a specification (do *this*). – John Coleman May 06 '16 at 14:26
  • 1
    Read this SO post: http://stackoverflow.com/questions/5351659/algorithms-for-string-similarities-better-than-levenshtein-and-similar-text – Martin May 06 '16 at 14:30
  • Okay I try it different. I mean with similarty that a setence is equal if it has almost the same meaning. It's okay if the words mixed arround, a number is changed or changed e.g. from positiv to negativ with a not. I added a function that does almost what I want but it's not "great", just good. And sorry for my phrasing. It's hard for me to explain what I'm thinking in in english – Mr.Tr33 May 06 '16 at 14:31

0 Answers0