54

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.

I tried to find information on the algorithm used as mentioned on php: similar_text()Docs:

<?php
$p = 0;
similar_text('aaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//66.666666666667
//Since 5 out of 10 chars match, I would expect a 50% match

similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//40
//5 out of 20 > not 25% ?

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//9.5238095238095 
//5 out of 100 > not 5% ?


//Example from PHP.net
//Why is turning the strings around changing the result?

similar_text('PHP IS GREAT', 'WITH MYSQL', $p);
echo $p . "<hr>"; //27.272727272727

similar_text('WITH MYSQL', 'PHP IS GREAT', $p);
echo $p . "<hr>"; //18.181818181818

?>

Can anybody explain how this actually works?

Update:

Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2

Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);

So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//10
//5 out of 95 = 5 * 200 / (5 + 95) = 10

But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.

In JS:

for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);

In PHP: (php_similar_str function)

for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);

Source:

/* {{{ proto int similar_text(string str1, string str2 [, float percent])
   Calculates the similarity between two strings */
PHP_FUNCTION(similar_text)
{
  char *t1, *t2;
  zval **percent = NULL;
  int ac = ZEND_NUM_ARGS();
  int sim;
  int t1_len, t2_len;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
    return;
  }

  if (ac > 2) {
    convert_to_double_ex(percent);
  }

  if (t1_len + t2_len == 0) {
    if (ac > 2) {
      Z_DVAL_PP(percent) = 0;
    }

    RETURN_LONG(0);
  }

  sim = php_similar_char(t1, t1_len, t2, t2_len);

  if (ac > 2) {
    Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
  }

  RETURN_LONG(sim);
}
/* }}} */ 


/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */

Source in Javascript: similar text port to javascript

hakre
  • 193,403
  • 52
  • 435
  • 836
Hugo Delsing
  • 13,803
  • 5
  • 45
  • 72
  • 1
    [Ian Oliver, Programming classics: implementing the world's best algorithms](http://books.google.de/books/about/Programming_classics.html?id=_oFQAAAAMAAJ&redir_esc=y), also [available on Amazon](http://www.amazon.com/Programming-Classics-Implementing-Worlds-Algorithms/dp/0131004131) – Gordon Jan 03 '13 at 09:49
  • 3
    also see http://stackoverflow.com/questions/3084608/what-is-the-paper-oliver-1993-describing-a-php-algorithm-to-calculate-text-s – Nanne Jan 03 '13 at 09:51
  • I found the book, but was hoping to find information without buying it – Hugo Delsing Jan 03 '13 at 09:53
  • See the port of this function to javascript: http://phpjs.org/functions/similar_text/ This uses probably the same algorithm (not sure). – dfsq Jan 03 '13 at 09:55
  • Found some implementations: http://stackoverflow.com/questions/2000440/php-similar-text-in-java And as it says in there, browse the PHP source code to sind it :). – Bart Friederichs Jan 03 '13 at 09:56
  • dfsq > Thanks, although the JS functions produces the same output on the last two examples (~27%). But it sure helps – Hugo Delsing Jan 03 '13 at 10:00
  • 2
    You should check PHP source codes then for original implementation. – dfsq Jan 03 '13 at 10:02
  • The fact that you're using the same character repeatedly is likely making a big difference to the result. I don't know the algorithm itself, but it clearly does more analysis on the string than simply looking at the length of the strings. If it was a simple comparison, it probably wouldn't be a built-in function since you could writ it easily enough yourself. Also, you're checking the % value given in the 3rd param, but you should also see the function return value, which (according to the doc) is the number of chars changed. This may give a value closer to the figure you're expecting. – SDC Jan 03 '13 at 10:14
  • @HugoDelsing Think I tackled all the questions in this thread. Please check my answer and offer any feedback if needed. – Khez Jan 09 '13 at 12:33
  • Your response did look promising, so I awarded you a +1 for the effort. I didnt have time to look into it just yet, because at a first glance a few things look strange (mostly in the wert/test part) – Hugo Delsing Jan 09 '13 at 13:45
  • 1
    @HugoDelsing want me to go further into how the function iterates to find the result ? I figured going top-down on the answer was rather clear, I'm sorry if I was mistaken. – Khez Jan 09 '13 at 22:00
  • I would suggest using edit-distance (guaranteed symmetrical, well-known DP algorithm to calculate) as a better metric than "percentage similarity". May be off-topic, though. – tucuxi Jan 12 '13 at 13:34
  • It's a bit old but if people want fix to the params behaviour vote the bug to make it visible: https://bugs.php.net/bug.php?id=62648&thanks=6 – nax Oct 02 '13 at 10:46
  • Your link includes the `&thanks=6` which already shows the "thanks for voting message". Your link should be https://bugs.php.net/bug.php?id=62648 so you can actually vote – Hugo Delsing Oct 04 '13 at 08:05

4 Answers4

30

This was actually a very interesting question, thank you for giving me a puzzle that turned out to be very rewarding.

Let me start out by explaining how similar_text actually works.


Similar Text: The Algorithm

It's a recursion based divide and conquer algorithm. It works by first finding the longest common string between the two inputs and breaking the problem into subsets around that string.

The examples you have used in your question, actually all perform only one iteration of the algorithm. The only ones not using one iteration and the ones giving different results are from the php.net comments.

Here is a simple example to understand the main issue behind simple_text and hopefully give some insight into how it works.


Similar Text: The Flaw

eeeefaaaaafddddd
ddddgaaaaagbeeee

Iteration 1:
Max    = 5
String = aaaaa
Left : eeeef and ddddg
Right: fddddd and geeeee

I hope the flaw is already apparent. It will only check directly to the left and to the right of the longest matched string in both input strings. This example

$s1='eeeefaaaaafddddd';
$s2='ddddgaaaaagbeeee';

echo similar_text($s1, $s2).'|'.similar_text($s2, $s1);
// outputs 5|5, this is due to Iteration 2 of the algorithm
// it will fail to find a matching string in both left and right subsets

To be honest, I'm uncertain how this case should be treated. It can be seen that only 2 characters are different in the string. But both eeee and dddd are on opposite ends of the two strings, uncertain what NLP enthusiasts or other literary experts have to say about this specific situation.


Similar Text: Inconsistent results on argument swapping

The different results you were experiencing based on input order was due to the way the alogirthm actually behaves (as mentioned above). I'll give a final explination on what's going on.

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

On the first case, there's only one Iteration:

test
wert

Iteration 1:
Max    = 1
String = t
Left :  and wer
Right: est and 

We only have one iteration because empty/null strings return 0 on recursion. So this ends the algorithm and we have our result: 1

On the second case, however, we are faced with multiple Iterations:

wert
test

Iteration 1:
Max    = 1
String = e
Left : w and t
Right: rt and st

We already have a common string of length 1. The algorithm on the left subset will end in 0 matches, but on the right:

rt
st

Iteration 1:
Max    = 1
String = t
Left : r and s
Right:  and 

This will lead to our new and final result: 2

I thank you for this very informative question and the opportunity to dabble in C++ again.


Similar Text: JavaScript Edition

The short answer is: The javascript code is not implementing the correct algorithm

sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));

Obviously it should be first.substr(0,pos1)

Note: The JavaScript code has been fixed by eis in a previous commit. Thanks @eis

Demystified!

Community
  • 1
  • 1
Khez
  • 10,172
  • 2
  • 31
  • 51
  • Wow, excellently done! So we almost need to develop a dynamic programming solution? – Addo Solutions Jan 09 '13 at 01:27
  • 2
    Honestly, I don't think what we're experiencing is a bug. It's merely designed that way. Argument swapping is merely masking a design decision. Taking that into considering, it all depends on what you need and how you plan on working with the result. Do you need strings that sound the same? `soundex`. You need them to sound the same in english? `metaphone`. You want a general similarity function? `levenshtein`. Strangely enough, `similar_text` has a worse complexity, compared to `levenshtein` and possibly, unless fully aware, really really bad results. – Khez Jan 09 '13 at 01:48
  • @Khez I *think* the difference between javascript and PHP versions is actually a bug in the PHP implementation. Other than that, I think you are correct. – eis Jan 09 '13 at 10:21
  • @eis Tried juggling test/wert had the same results as in the PHP version. What difference did you actually find ? – Khez Jan 09 '13 at 10:33
  • @Khez um? test/wert has the same results in both versions, like I explained: it's the case OP raised with mysql-php string that has a difference. – eis Jan 09 '13 at 11:41
  • @eis It has a difference, if you have bugs in your algorithm! Updated my answer... – Khez Jan 09 '13 at 12:29
  • @Khez np. I guess I need to update my own answer as well. Good job finding the issue. – eis Sep 05 '14 at 07:18
27

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

#include<stdio.h>

/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      printf("txt here %s,%s\n", txt1, txt2);
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */
int main(void)
{
    printf("Found %d similar chars\n",
        php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
    printf("Found %d similar chars\n",
        php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
    return 0;
}

the result is output:

txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here  GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

txt here: PHP, WIT
txt here: P IS GREAT,  MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here:  GREAT, QL
Found 3 similar chars
txt here: WITH, PHP 
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of @Khez and the fix, which has been merged now.

eis
  • 51,991
  • 13
  • 150
  • 199
  • This sounds correct to me. It's also worth thinking about, "Why am I using similar_text(), instead of levenshtein() and / or metaphone()?" I've written string deduplication code, and string similar search code, using levenshtein() and metaphone() to much great results instead of similar_text(). A related technique worth looking into is TFIDF (total frequency vs. inverse domain frequency), when try to identify unique strings. – Apollo Clark Jan 08 '13 at 17:30
  • Regarding the **unconfirmed** bug, both the one in the ticket and the comment of the ticket on PHP.net, feels like that's the way the algorithm is meant to work... – Khez Jan 09 '13 at 01:52
  • @Khez yes, I agree. The actual confirmation what it should really do is still missing though so we won't know for sure. – eis Jan 09 '13 at 08:31
  • @eis Check my answer for an explanation of the underlying algorithm. I'm really uncertain what it was trying to achieve, seeing as how levenshtein is faster and has better results – Khez Jan 09 '13 at 09:09
  • @Khez I think you should note that OP actually has two algorithms here, one for PHP and one for javascript. I think the javascript version was the thing they tried to have, but failed with PHP implementation instead. :P – eis Jan 09 '13 at 10:27
  • @eis I disagree, OP clearly wanted to know how the PHP implementation worked. They used the JS implementation as a reference algorithm. Also note, my answer applies to the JS implementation also. Due to the algorithm's main "flaw"... – Khez Jan 09 '13 at 10:30
  • @Khez to quote, "But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this." – eis Jan 09 '13 at 11:42
  • My main goal is to understand how the PHP version works. As I dont have the tools/knowledge to mess around with C code like you did, I was using the JS version to echo parts of the text. But since the results of the PHP and JS version are different, I couldnt get my awnser with just the JS code. – Hugo Delsing Jan 09 '13 at 14:14
  • @Khez gah, that was a simple one :) Darn. You are right in that it accounts for the difference. It looks however due to that, javascript version actually is closer to the thing that should be the "correct answer", if someone would ask me. Interesting. Good find! – eis Jan 09 '13 at 14:22
  • @HugoDelsing was my not post helpful? Do you still require futher explanation? – Khez Jan 09 '13 at 14:25
  • @HugoDelsing added codepad link to my answer - you don't actually need more tools than just your browser to mess around with the code. But ok, main thing was anyway to get this sorted. :) – eis Jan 09 '13 at 14:26
  • Thanks for the awnser. This is how I wanted to do it myself, but couldnt. Also thanks for the codepad link. – Hugo Delsing Jan 14 '13 at 10:18
  • @HugoDelsing It surprises me how many wrong the accepted answer is. The JavaScript implementation is different from the PHP implementation due to a typo. The PHP Implementation explained is not really accurate to how the algorithm works. The "bugs" found for the PHP implementation are nowhere near bugs... but sigh, it's accepted. Just Sigh. – Khez Jan 16 '13 at 15:52
  • @Khez you do note that the answer was revised before accepting, and it doesn't claim anymore that there are bugs in the PHP implementation. Javascript difference was noted also here - the answer argues that the behaviour is anyway up for speculation (be it typo or not). – eis Jan 16 '13 at 16:21
12
first String = aaaaaaaaaa = 10 letters
second String = aaaaa = 5 letters

first five letters are similar
a+a
a+a
a+a
a+a
a+a
a
a
a
a
a


( <similar_letters> * 200 ) / (<letter_count_first_string> + <letter_count_second_string>)

( 5 * 200 ) / (10 + 5);
= 66.6666666667
spinsch
  • 1,415
  • 11
  • 23
1

Description int similar_text ( string $first , string $second [, float &$percent ] )

This calculates the similarity between two strings as described in Oliver [1993]. Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string. Parameters

first

The first string.

second

The second string.

percent

By passing a reference as third argument, similar_text() will calculate the similarity in percent for you.
Kamesh S
  • 143
  • 2
  • 8