1

Levenshtein is an algorithm for finding the Levenshtein distance between two strings. string_similarity also acts in a similar way - calculate similarity and output a score.

This question has a couple moving parts.

Take two strings:

$string1="adipesesing et";
$string2="Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua";

Now looking at $string1, "adipesesing et" would find the most similarity in the substring "adipisicing elit" within $string2, where the first word is mispelled and the latter is an abbreviation. The above functions would not calculate the score based on this substring, but all of $string2.

Is there a popular method for finding the substring with the most similarity with the two functions?

alrightgame
  • 341
  • 2
  • 7
  • 20
  • 1
    https://stackoverflow.com/questions/16520646/how-to-check-a-partial-similarity-of-two-strings-in-php may or may not be helppfull –  Feb 26 '19 at 21:48

1 Answers1

1

It may not be ideal, but one possibility is to split both the strings into words, you can apply similar_text to each set of words from $string2 of the same size as $string1.

$needle_word_count = count(preg_split('/\W+/', $string1));

$haystack_words = preg_split('/\W+/', $string2);

$n = count($haystack_words) - $needle_word_count;
for ($i = 0; $i <= $n; $i++) {
    $words = array_slice($haystack_words, $i, $needle_word_count);
    $substring = implode(' ', $words);
    $results[$substring] = similar_text($substring, $string1);
}
arsort($results);

/* $results:

array (size=17)
  'adipisicing elit' => int 11
  'consectetur adipisicing' => int 8
  'sit amet' => int 5
  'labore et' => int 5
  'eiusmod tempor' => int 4
  'dolor sit' => int 4
  'ipsum dolor' => int 4
  'tempor incididunt' => int 4
  'Lorem ipsum' => int 3
  'incididunt ut' => int 3
  'sed do' => int 3
  'do eiusmod' => int 3
  'elit sed' => int 3
  'amet consectetur' => int 3
  'dolore magna' => int 3
  'et dolore' => int 2
  'ut labore' => int 1
*/
Don't Panic
  • 41,125
  • 10
  • 61
  • 80