0

How can I search for an specific string with a margin of error?

Example:

I have a table with the following values:

Brands Table

  • Brand: Panasonic, Model: 15T
    • Brand: Apple, Model: IPHONE 7
    • Brand: Samsung, Model: Galaxy S8
    • Brand: Microsoft, M15

And I want to find a coincidence with a margin of 3 wrong characters. For the example my input is M$crosoft and I want it to return the Microsoft row. Or if I input Pnasonic it should input the Panasonic row.

How can I achieve this without sacrificing performance? The easy road is to compare each of the characters and a counter of the 3 errors, but I need performance since the brands table has around 200K+ rows.

P.S I'm coding in PHP.

  • 1
    Possible duplicate of [php (fuzzy) search matching](https://stackoverflow.com/questions/3208743/php-fuzzy-search-matching) – Obsidian Age Nov 10 '17 at 01:58
  • levenshtein, or metaphone. I prefer combining them. Metaphone you'll want to pre-compile them and save them in an index field in your DB, then search by using it on the input string, then levenshtein, can weed out "false hits" from the Sound index. – ArtisticPhoenix Nov 10 '17 at 02:01

1 Answers1

0

You'll probably want to use a combination of Metaphone, and Levenshtein. (for spelling errors )

http://php.net/manual/en/function.metaphone.php

And

http://php.net/manual/en/function.levenshtein.php

Metaphone works on sounds so a "poor" example is you can think of it as removing the vowels and changing some compound sounds to single letters ( almost like shorthand ). So using your example

$sound1 = metaphone('M$crosoft');

echo "$sound1\n";

$sound2 = metaphone('Microsoft');

echo "$sound2\n";

Outputs

MKRSFT
MKRSFT

As you can see they match.

You can test it here

http://sandbox.onlinephpfunctions.com/code/716471f5fed18268a2dc0aea800b3db634d9616f

Performance Because of the extra overhead of running metaphone, I would suggest pre-computing the Sound Index, of the words you search against before hand, and saving them in the database. Then when you run a user search you run the same metaphone function on their search words, and use that to search the sound index in the table. This way you are front loading the cost of building the sound index and only have to do it once ( or when the records are edited )

However, you may find the matching too loose, in which case you can use Levenshtein. This calculates the difference between 2 words based on the changes that are needed. Such as inserts updates and deletes that need to be made, you can even weight the operations.

$len = levenshtein ('M$crosoft', 'Microsoft');

echo "$len\n";

//as you can see the arguments are $str1, $str2, insert cost, replace cost, delete cost
//so we can control what weight we get for each operation.
$len = levenshtein ('M$crosoft', 'Microsoft', 1,2,1);

echo "$len\n";

Ouputs

1
2

Now if you need to combine that with a "Bunch" of text that can get to be real complicated, as you'll have to use Full text searching on the DB.

It's not trivial.

Possibly a better choice would be to look into using something like Sphinx, which is a full text search engine. It's not hard to get it setup basic like. But it won't be a magic bullet either so you'll have to do some things like stemming, and wordforms etc..

Again not trivial but it does have much better full text searching then the Mysql DB,

http://sphinxsearch.com/

Performance I can tell you it is fast, possibly 20x faster then MySql on text searching, but it comes with it's own quirks. But I highly recommend it. We run around 150k searches though our small sphinx cluster a minute, on a record set that is 1/4 million rows. ( our main server is a 12 core, 54GB monster though )

This type of searching there is no One sure fire solution, or at least I haven't found it yet ( and I have done a ton of it ).

ArtisticPhoenix
  • 21,464
  • 2
  • 24
  • 38