0

I am currently trying to figure out how to calculate the similarity between two records. My first record would be from a deactivated advertisement - so I want to find e.g. the 10 most similar advertisement regarding to some VARCHAR-fields equalness.

The thing, I can't figure out is, if there is any MySQL function, that can help me in any way - or if I need to compare the strings in some weird way?

EDIT #1

Similarity would be defined by these fields:

  • Title (weight: 50 %)

  • Content (weight: 40 %)

  • Category (weight: 10 %)

EDIT #2

I want the calculation to be like this:

Title: Words that match in the title field (only words >2 letters are matched).

Description: Words that match in the title field (only words >2 letters are matched).

Catgory: Match the category and if that doesn't match match the parent category with less weight :)

An equation of this could be:

#1 is the old, inactive post, #2 is the active post:

#2 title matches #1 title in 3 words out of #2's total of 10 words. That gives 30 % match = 30 points.

#2 description matches #1 description in 10 words out of #2's total of 400 words. That gives a 4 % match = 4 points.

#2 category doesn't match #1's category, therefore 0 % match. That gives 0 points.

Then the sum would be 34 points for #2. :)

Edit #3

Here's my query - but it doesn't return different rows, but a lot of the same row.

SELECT
            a.AdvertisementID as A_AdvertisementID,
            IF(a.Topic LIKE a2.Topic, 50, 0) + IF(a.Description LIKE a2.Description, 40, 0) + IF(a.Cate_CategoryID LIKE a2.Cate_CategoryID, 10, 0) as A_Score,
            a.AdvertisementID as A_AdvertisementID,
            a.Topic as A_Topic,
            LEFT(a.Description, 300) as A_Description,
            a.Price as A_Price,
            a.Type as A_Type
        FROM
            ".DB_PREFIX."A_Advertisements a2,
            ".DB_PREFIX."A_Advertisements a
        WHERE
            a2.AdvertisementID <> a.AdvertisementID
            AND
            a.AdvertisementID = :a_id
        ORDER BY
            A_Score DESC
denlau
  • 916
  • 2
  • 9
  • 21
  • 2
    How are you defining "similarity"? – Rowland Shaw Nov 02 '13 at 20:01
  • Hey Rowland Shaw - I have tried to define it in my latest edit. Hope this makes sense :) – denlau Nov 02 '13 at 20:04
  • http://www.artfulsoftware.com/infotree/queries.php#552 – Mihai Nov 02 '13 at 20:04
  • Are you looking for equality in those fields? Or do you have some idea of how to calculate a metric of "how similar" two strings are? – Rowland Shaw Nov 02 '13 at 20:06
  • Rowland: Nah, but I think, I want to match words. In the total advanced query, I would have the possibility to not give any value to 0-2 letter words, but match the rest. Then I want to take the sum of the matches and add the weight in percentages. I'll add a calculation above. – denlau Nov 02 '13 at 20:09

1 Answers1

2

If you can literally compare the fields you are interested in, you could have MySQL perform a simple scoring calculation using the IF() function, for example

select 
  foo.id,
  if (foo.title='wantedtitle', 50, 0) +
  if (foo.content='wantedcontent', 40, 0) +
  if (foo.category='wantedcategory', 10, 0) as score
from foo
order by score desc
limit 10

A basic 'find a fragment' could be achieved using like

select 
  foo.id,
  if (foo.title like '%wantedtitlefragment%', 50, 0) +
  if (foo.content like '%wantedcontentfragment%', 40, 0) +
  if (foo.category like '%wantedcategoryfragment%', 10, 0) as score
from foo
order by score desc
limit 10

There are other techniques, but they might be slow to implement in MySQL. For example, you could calculate the Levenstein distance between two string - see this post for an example implementation.

Community
  • 1
  • 1
Paul Dixon
  • 295,876
  • 54
  • 310
  • 348
  • Could I use LIKE instead of equal sign? :) – denlau Nov 02 '13 at 20:09
  • Yes, that would work just fine for a basic wildcard. – Paul Dixon Nov 02 '13 at 20:11
  • Okay, now my problem is, that I only know the OLD advertisements URL. How can I join so it will get the 10 with highest "score"? :) – denlau Nov 02 '13 at 21:12
  • Hey Paul, I've changed my question - can you help on edit #3? :) – denlau Nov 02 '13 at 22:08
  • Make sure you index the columns, especially when using '%...%' searches (in this case, only FULLTEXT indexes will be useful). Also, I don't think it's fair that you keep added more and more requests on top of your original question... that isn't how SO works. – Terry Nov 02 '13 at 22:10
  • Terry: Okay, that's my mistake. :) – denlau Nov 02 '13 at 22:21