1

There is something I am trying to accomplish although I'm not really sure where to start.

I currently have a MySql database with a list of articles. The DB contains the article title, content, and some other info like dates, etc.

There is an RSS feed that we monitor for new articles, it's a Google Alert feed that just contains the latest news on certain subjects. I want to be able to automatically monitor this feed and record any feed items that are similar to stories currently in our DB.

I know how to set a script to run automatically, and I know how to parse the RSS feed with SimplePie.

What I need to figure out is how to take the description of the rss feed items, run a check on our DB to see if the feed item is similar to something we have in our DB, and return a numerical score of some sort, sort of like a "similarity rating" or something.

After that I can have the info I need recorded to the DB if the "similarity rating" is above a set limit, which I know how to do.

So my only issue is how to compare each feed item to our current articles, and return a score based on how similar it is.

Sherwin Flight
  • 2,345
  • 7
  • 34
  • 54
  • As a reverse example, there's a classifieds website I use often. They prohibit posting more than one ad for the same item. There was once I had tried to re-post my add, but forgot to delete my original one, and it said it was too similar to another ad of mine. I tried rearranging the words a bit, and it still said the same thing. So it know that my second ad was very similar to my original. I need to do whatever they are doing, but rather than blocking the very similar stories I want those recorded. Just trying to clarify a bit what I'm talking about. – Sherwin Flight Apr 02 '12 at 04:13

1 Answers1

1

The Levenshtein function (available for both PHP and MySQL) is a good way to handle this. It basically calculates a value based on the number of permutations (replacements, moves, etc) required to convert one string to another. That score would be your "similarity rating".

EDIT: the Levenshtein function is not available natively in MySQL but there are SQL implementations of it that you can use such as: http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/

TheOx
  • 2,208
  • 25
  • 28
  • 1
    http://stackoverflow.com/questions/4671378/levenshtein-mysql-php covers this topic a litle bit and might be of use. – TheOx Apr 02 '12 at 04:23
  • I'm going to mark this as an accepted answer, because the function you mentioned looks like it can help me with what I need. – Sherwin Flight Apr 02 '12 at 04:27
  • The MySql functions looks like my answer. Was hoping there would be a MySql function, so that I didn't have to loop through it over and over with PHP. – Sherwin Flight Apr 02 '12 at 04:28