How to find similarity between mySQL rows?

Question

I am trying to create a script that finds a matching percentage between my table rows. For example my mySQL database in the table products contains the field name (indexed, FULLTEXT) with values like

LG 50PK350 PLASMA TV 50" Plasma TV Full HD 600Hz 
LG TV 50PK350 PLASMA 50"
LG S24AW 24000 BTU
Aircondition LG S24AW 24000 BTU Inverter

As you may see all of them have some same keyword. But the 1st name and 2nd name are more similar. Additionally, 3rd and 4th have more similar keywords between them than 1st and 2nd.

My mySQL DB has thousands of product names. What I want is to find those names that have more than a percentage (let's say 60%) of similarity.

For example, as I said, 1st, 2nd (and any other name) that match between them with more than 60%, will be echoed in a group-style-format to let me know that those products are similar. 3rd and 4th and any other with more than 60% matching will be echoed after in another group, telling me that those products match.

If it is possible, it would be great to echo the keywords that satisfy all the grouped matching names. For example LG S24AW 24000 BTU is the keyword that is contained in 3rd and 4th name.

At the end I will create a list of all those keywords.

What I have now is the following query (as Jitamaro suggested)

Select t1.name, t2.name From products t1, products t2

that creates a new name field next to all other names. Excuse me that I don't know how to explain it right but this is what it does: (The real values are product names like above)

Before the query

-name-
A
B
C
D
E

After the query

-name- -name-
A        A
B        A
C        A
D        A
E        A
A        B
B        B
C        B
D        B
E        B
.
.
.

Is there a way either with mySQL or PHP that will find me the matching names and extract the keywords as I described above? Please share code examples.

Thank you community.

I don't think you will get the code here. You will get an algorithm. It's not us making your homework. — Micromega, Aug 08 '11 at 10:10
upto how many different shops you have selling same product and do all (or most) of the product names contain model number like `50PK350`? — Imre L, Aug 08 '11 at 22:21
There is no limit on how many shops are selling the same product (any product). A product has 99% to contain same terms with the same product from a different shop — EnexoOnoma, Aug 09 '11 at 00:22
Maybe using bigrams or trigrams? ... http://bit.ly/o5Iw8o and on Wikipedia ... http://bit.ly/ptpUJ4 — dgnorton, Aug 12 '11 at 17:38

Pedro Lobito · Answer 1 · 2016-12-18T15:20:34.760

6

Query the DB with LIKE OR REGEXP:

SELECT * FROM product WHERE product_name LIKE '%LG%';
SELECT * FROM product WHERE product_name REGEXP "LG";

Loop the results and use similar_text():

$a = "LG 50PK350 PLASMA TV 50\" Plasma TV Full HD 600Hz"; // DB value
$b = "LG TV 50PK350 PLASMA 50\"" ; // USER QUERY

$i = similar_text($a, $b, $p);
echo("Matched: $i  Percentage: $p%");

//outputs: Matched: 21 Percentage: 58.3333333333%

Your second example matches 62.0689655172%:

$a = "LG S24AW 24000 BTU"; // DB value
$b = "Aircondition LG S24AW 24000 BTU Inverter" ; // USER QUERY

$i = similar_text($a, $b, $p);
echo("Matched: $i  Percentage: $p%");

You can define a percentage higher than, lets say, 40%, to match products.
Please note that similar_text() is case SensItivE so you should lower case the string.

edited Dec 18 '16 at 15:20

answered Aug 08 '11 at 21:52

Pedro Lobito

94,083
31
258
268

Hi there, ok with the similar text function but the main problem is how to check automatically all names like each name against another. Because the db will be huge, there should be no manual inputs. – EnexoOnoma Aug 09 '11 at 00:28
1

hi nicolay, I'm not sure if I understood "check automatically all names like each name against another", can you make that more clear ? – Pedro Lobito Aug 09 '11 at 16:20
Can you please take a look at my updated question for all the needed details ? – EnexoOnoma Aug 12 '11 at 09:00
Expensive table-scan for every product viewed? – Jeff Ferland Aug 12 '11 at 11:12

score 4 · Answer 2 · answered Aug 03 '11 at 14:10

4

As for your second question, the levenshtein() function (in MySQL) would be a good candidate.

answered Aug 03 '11 at 14:10

Alix Axel

151,645
95
393
500

score 2 · Answer 3 · edited May 23 '17 at 12:08

When I look at your examples, I consider how I would try to find similar products based on the title. From your two examples, I can see one thing in each line that stands out above anything else: the model numbers. 50PK350 probably doesn't show up anywhere other than as related to this one model.

Now, MySQL itself isn't designed to deal with questions like this, but some bolt-on tools above it are. Part of the problem is that querying across all those fields in all positions is expensive. You really want to split it up a certain way and index that. The similarity class of Lucene will grant a high score to words that rarely appear across all data, but do appear as a high percentage of your data. See High level explanation of Similarity Class for Lucene?

You should also look at Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Scoring each word against the Lucene similarity class ought to be faster and more reliable. The sum of your scores should give you the most related products. For the TV, I'd expect to see exact matches first, then some others of the same size, then brand, then TVs in general, etc.

Whatever you do, realize that unless you alter the data structures by using another tool on top of the SQL system to create better data structures, your queries will be too slow and expensive. I think Lucene is probably the way to go. Sphinx or other options not mentioned may also be up for consideration.

I think the Lucene approach (or something similar) would work quite well. I've implemented almost the same for a general product search function, though it didn't support auto-suggest, it would be pretty easy to implement, as Lucene already returns the 'distinct' products for a given search query. So you only have to setup a periodic script, that updates your lucene index with new/changed/removed products (I had to keep the Oracle database, because it was not under my control...). — Simon Lehmann, Aug 11 '11 at 14:35

score 1 · Answer 4 · answered Aug 06 '11 at 23:38

This is trickier than it seems and there is information missing in your post:

How are people going to use this auto-complete function?
Is it relevant that you can find all names for a product? Because apparently not all stores name their products similarly so a clerk might not be able to find the product (s)he found.
Do you have information about which product names are for the same product?
Is it relevant from which store you're searching? where is this auto-complete used?
Should the auto-complete really only suggest products that match all the words you typed? (it's not so hard, technically, to correct typos)

I think you need a more clear picture of what you (or better yet: the users) want this auto-complete function to do.

An auto-complete function is very much a user-friendly type feature. It aids the user, possibly in a fuzzy way so there is no single right answer. You have to figure out what works best, not what is easiest to do technically.

First figure out what you want, then worry about technology.

Hi there thanks for your reply, however i think that my question was clear enough of what i am looking for. The application will work best for me as i described. Any autocomplete issues are solved and different shops have similar products names something that i checked. ( i am not the downvoter). — EnexoOnoma, Aug 07 '11 at 18:48
I agree with Frits especially for third point. I dont know why Nikolai maintaining different rows for same product and he is not mentioned any logical db relation between them.I think that is why db is getting huge. — kiranking, Aug 10 '11 at 20:48

J0HN · Answer 5 · 2011-08-09T06:30:06.830

One possible solution is to use Damerau-Levenstein distance. It could be used like this

select *
from products p
where DamerauLevenstein(p.name, '*user input here*')<=*X*

You'll have to figure out X that suites your needs best. It should be integer greater than zero. You could have it hard-coded, parameterized or calculated as needed.

The trickiest thing here is DamerauLevenstein. It has to be stored procedure, that implements Damerau-Levenstein algorithm. I don't have MySQL here, so I might write it for you later this day.

Update: MySQL does not support arrays in stored procedures, so there is no way to implement Damerau-Levenstein in MySQL, except using temporary table for each function call. And that will result in terrible performance. So you have two options: loop through the results in PHP with levenstein like Alix Axel suggests, or migrate your database to PostgreSQL, where arrays are supported. There is also an option to create User-Defined function, but this requires writing this function in C, linking it to MySQL and possibly rebuilding MySQL, so this way you'll just add more headache.

Thank you for this, I have to say that this "list creation" will be created every time I will update the products, this means not very often, so I guess the performance is not a problem. I need to stay with mySQL or/and PHP though. — EnexoOnoma, Aug 09 '11 at 10:52
Well, I have to admitthat MySQL is not my favourite DB, so that way of using varbinary as array is new fo me. — J0HN, Aug 12 '11 at 04:00

score 0 · Answer 6 · edited May 23 '17 at 12:08

0

This question is similar :) to this one:

What is the best way to implement a substring search in SQL?

Trigram can easily find similar rows, and in that question i posted a php+mysql+trigram solution.

edited May 23 '17 at 12:08

Community

1
1

answered Feb 06 '13 at 22:11

steve

615
6
14

score 0 · Answer 7 · answered Aug 03 '11 at 13:34

Your approach seems sound. For matching similar products, I would suggest a trigram search. There's a pretty decent explanation of how this works along with the String::Trigram Perl module.

I would suggest using trigram search to get a list of matches, perhaps coupled with some manual review depending on how much data you have to deal with and how frequent you need to add new products. I've found this approach to work quite well in practice.

Micromega · Answer 8 · 2011-08-08T01:36:14.227

0

Maybe you want to find the longest common substring from the 2 strings? Then you need to compute a suffix tree for each of your strings see here http://en.wikipedia.org/wiki/Longest_common_substring_problem.

edited Aug 08 '11 at 01:36

answered Aug 07 '11 at 00:10

Micromega

12,486
7
35
72

Micromega · Answer 9 · 2011-08-11T15:57:13.297

0

If you want to check all names against each other you need a cross join in mysql. There are many ways to achieve this:

1. Select a, b From t1, t2

2. Select a, b From t1 Join t2

3. Select a, b From t1 Cross Join t2

Then you can loop through the result. This is the same when I say create a 2d array with n^2-(n-1) elements and each element is connected with each other.

P.S.: Select t1.name, t2.name From products t1, products t2

edited Aug 11 '11 at 15:57

answered Aug 09 '11 at 01:48

Micromega

12,486
7
35
72

Can you please tell me what sql query to use exactly ? My table is name products and the field that holds the product names is name. – EnexoOnoma Aug 11 '11 at 15:37
Select t1.name, t2.name From products t1, products t2 – Micromega Aug 11 '11 at 15:57
Aham I now get the point of this. But how can I have a matching percentage of t1, t2 ? or get the results into a php script so that I use some matching function ? – EnexoOnoma Aug 11 '11 at 16:24
This is a strange question. You can try for example a suffix tree? Like I have recommend. – Micromega Aug 11 '11 at 17:40
Sorry but I dont have the knowledge on how to make it working with the suffix tree – EnexoOnoma Aug 11 '11 at 18:58
It's a data structure and it's the same like a ternary tree or a simple trie. The idea is to fill a dictionary with all the suffixes from a word or sentence. A suffix tree is a very clever way to fill this data structure. You don't really need it. – Micromega Aug 11 '11 at 19:29

Jesse Seger · Answer 10 · 2011-08-12T11:07:42.027

0

It seems that you might always want to return the shortest string?? That's more or a question than anything. But then you might have something like...

SELECT * FROM products LIMIT 1
WHERE product_name like '%LG%'
ORDER BY LENGTH(product_name) ASC

edited Aug 12 '11 at 11:07

answered Aug 10 '11 at 17:47

Jesse Seger

951
1
13
31

No, I want the string that satisfies all similar products. This may have manual work but what I want for this is to find whether a name is similar to others. There will be no manual inputs, each name will be checked against all others. – EnexoOnoma Aug 10 '11 at 21:14
Oh ok. But, I'm not sure what you mean by "no manual inputs" or "each name will be checked against all others." – Jesse Seger Aug 11 '11 at 12:07
In product_name I have those values A, B, C, D, E. A must be checked against B, C, D, E. B will be checked against A, C, D, E. C will be checked against A, B, D, E and so on. If there is a option we can ignore B against A if there was earlier an A against B. – EnexoOnoma Aug 11 '11 at 12:55
Oh.. Well that is easy then. Now all you have to do now is cleary identify what "checked against" looks like. – Jesse Seger Aug 11 '11 at 13:11
Nikolai: I have given you a simple solution in my answer using a cross join. Did you consider this? – Micromega Aug 11 '11 at 14:15
@Jesse Seger As I said in my question, I want to find how similar is the A to B, to C to D and so on. I have all the details in my question. – EnexoOnoma Aug 11 '11 at 15:39
@Jitamaro I have replied on your answer. – EnexoOnoma Aug 11 '11 at 15:39
Well, I have to say that @Tuga probably has the closest thing so far. – Jesse Seger Aug 12 '11 at 11:12

score 0 · Answer 11 · answered Aug 11 '11 at 17:37

0

It sounds like you've gone through all this trouble to explain a complex scenario, then said that you want to ignore the optimal answers and just get us to give you the "handshake" protocol (everything is compared to everything that hasn't been compared to it yet). So... pseudocode:

select * from table order by id
while (result) {
    select * from table where id > result_id
}

That will do it.

answered Aug 11 '11 at 17:37

Jeff Ferland

17,832
7
46
76

Isn't this the same like Select a, b From t1, t2? – Micromega Aug 11 '11 at 17:55
What you typed gives N^2 results. What I typed is N^2/2 results. Put another way, that keeps any row on the left from being paired up with a row that it was already on the right of. – Jeff Ferland Aug 11 '11 at 18:16
I dont have auto increment but i will add if it is necessary. Yes I use mySQL. For your concern: My table is named products and the field that holds the product names is name – EnexoOnoma Aug 11 '11 at 18:56
This works on any database where you use a column that is unique and indexed. That can be an auto_incremented id or a varchar of some kind or just about any other field type. – Jeff Ferland Aug 11 '11 at 19:05
I mean N^2-(n-1) but this is useful for me. The last row gets compared with the last row. – Micromega Aug 11 '11 at 19:20
No, the last row doesn't get compared to itself. x is not greater than x. – Jeff Ferland Aug 11 '11 at 21:05
Can you upload an image of this? When I have a graph isn't it half the graph in diagonale? – Micromega Aug 11 '11 at 21:24
http://imgur.com/IfeaA Also, n*(n-1)/2 if we want to be exact about the equation. – Jeff Ferland Aug 11 '11 at 21:31

score 0 · Answer 12 · answered Aug 12 '11 at 01:47

If your database simply had a UPC code as one of it's fields, and this field was well-maintained, i.e., you could trust that it was entered correctly by the database maintainer and correctly reflected what the item was -- then you wouldn't need to do all of the work you suggest.

An even better idea might be to have a UPC field in your next database -- and constrain it as unique.

Database users attempt to put an-already-existing UPC into the database -- they get an error.

Database maintains its integrity.

And if such a database maintained its integrity -- the necessity of doing what you suggest never arises.

This probably doesn't help much with your current task (apologies) -- but for a future similar database -- you might wish to think about it...

Tae-Sung Shin · Answer 13 · 2011-08-13T16:07:41.673

0

This is a clustering problem, which can be resolved by a data mining method. ( http://en.wikipedia.org/wiki/Cluster_analysis) It requires a lot of memory and computation intensive operations which is not suitable for database engine. Otherwise, separate data mining, text mining, or business analytics software wouldn't have existed.

edited Aug 13 '11 at 16:07

answered Aug 12 '11 at 05:32

Tae-Sung Shin

20,215
33
138
240

score 0 · Answer 14 · answered Aug 12 '11 at 05:49

0

I`d advise you to use some fulltext search engine, like sphinx. It has possibilities to implement any algorithm you want. For example, you may use "quorom" or "any" searches.

answered Aug 12 '11 at 05:49

Timur

6,668
1
28
37

score -1 · Answer 15 · answered Aug 03 '11 at 13:15

-1

You can use LIKE to find similar product names within the table. For example:

SELECT * FROM product WHERE product_name LIKE 'LG%';

answered Aug 03 '11 at 13:15

Justin Ethier

131,333
52
229
284

1

This is for sure a way that will help but what if you don't have a sample list and the DB has thousands of entries? I think a more "automatic" solution is needed – EnexoOnoma Aug 03 '11 at 14:44

score -1 · Answer 16 · answered Aug 08 '11 at 00:53

-1

Here is another idea (but I'm voting for levenshtein()):

Create a temporary table of all words used in names and their frequencies.

Choose range of results (most popular words are probably words like LCD or LED, most unique words could be good, they might be product actual names).

Suggest for each of result words either:

results with those words
results containing longest substring (like this: http://forums.mysql.com/read.php?10,277997,278020#msg-278020 ) of those words.

answered Aug 08 '11 at 00:53

Chris Hasiński

2,965
2
25
34

Why would this better then just the longest substring like you mentioned or like I suggest? What is this range of results? Is this an algorithm? Counting frequenices and build a tree is cs. – Micromega Aug 08 '11 at 01:06
The idea was to assign all products to groups according to the list least frequent words. You can use longest substring of those words to compensate for minor differences like suffixes). Range of results for word frequencies are just results with frequency smaller than a threshold you select. Basically it's simplified clustering using custom distance function. – Chris Hasiński Aug 08 '11 at 01:23

score -1 · Answer 17 · answered Aug 11 '11 at 11:52

Ok, I think I was trying to implement very much similar thing. It can work the same as the google chrome address box. When you type the address it gives you the suggestions. This is what you are trying to achieve as far I am concerned.

I cannot give you exact solution to that but some advice.

You need to implement the dropdown box where someone starts to enter the product they are looking for
Then you need to get the current value of the dropdown and then run query like guy posted above. Can be "SELECT * FROM product WHERE product_name LIKE 'LG%';"
Save results of the query
Refresh the page
Add the results of the query to the dropdown

Note:

You need to save the query results somewhere like the text file with the HTML code i.e. "option" LG TS 600"/option" (add <> brackets to option of course). This values will be used for populating your option box after the page refresh. You need to set up the users session for the user to get the same results for the same user, otherwise if more users would use the search at the same time it could clash. So, with the search id and session id you can match them then. You can save it in the file or the table. Table would be more convenient. It is actually in my sense the whole subsystem for that what are you looking for.

I hope it helps.

Hello thank you for your info it may help me. I have already the autocomplete script working as I wrote. I think that my question is clear enough of what I am exactly looking for though. — EnexoOnoma, Aug 11 '11 at 12:53

How to find similarity between mySQL rows?

17 Answers17