0

I was coding a little PHP script that find how much two strings are similar, in percent.

I have this code. If you try to invert the position of the two variable, as seen in figure, the result is different.

<?php 
$var_1 = 'PHP IS GREAT'; 
$var_2 = 'WITH MYSQL'; 

$var_1 = trim(strtolower ( $var_1 ));
$var_2 = trim(strtolower ( $var_2 ));

similar_text($var_1, $var_2, $percent); 

echo $percent; 
// 27.272727272727 

similar_text($var_2, $var_1, $percent); 

echo $percent; 
// 18.181818181818 
?>

Can someone tips me a better PHP function or explain why the two results are different?

Jonathan Besomi
  • 322
  • 2
  • 8

3 Answers3

1

Use levenshtein():

$var_1 = 'PHP IS GREAT';
$var_2 = 'WITH MYSQL';
var_dump(levenshtein($var_1, $var_2));
var_dump(levenshtein($var_2, $var_1));

Output:

int(11)
int(11)
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
0

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which hasn't been confirmed by anyone.

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. I think it is possible to argue that as intended behaviour. Not sure if it is.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways.

As result use levenshtein()

sergio
  • 5,210
  • 7
  • 24
  • 46
0

Calulation of similar_text() is:

similar_chars * 200 / (t1_length + t2_length)

Case 1: 3 * 200 / (10+12) = 27.27;

Case 2: 2 * 200 / (10+12) = 18.18;

Why Case 1?

Step 1: PHP_IS_GREAT <-> WITH MYSQL => when H is found all before this char will be deleted in both string

Step 2: P_IS_GREAT <-> _MYSQL => Now Whitespace will be found and all before this char will be deleted in both strings

Step 3: IS_GREAT <-> MYSQL => S will be found and all before this char will be deleted in both strings

Step 4: _GREAT <-> QL => Nothing else found: Result=3;

Why Case 2?

Step 1: WITH_MYSQL <-> PHP_IS_GREAT => I found and all before this char will be deleted in both strings

Step 2: TH_MYSQL <-> S_GREAT => T found all before this char will be deleted in both strings

Step 3: H_MYSQL <-> nothing left to compare => Nothing else found: Result=2;

Found here: How does similar_text work?

and wrote it in a bit shorter way.

Community
  • 1
  • 1
Lustknylch
  • 21
  • 4