5

I see an ever increasing number of users signing up on my site to just send duplicate SPAM messages to other users. I've added some server side code to detect duplicate messages with the following mysql query:

  SELECT count(content) as msgs_sent 
    FROM messages 
   WHERE sender_id = '.$sender_id.' 
GROUP BY content having count(content) > 10

The query works well but now they're getting around this by changing a few charctersr in their messages. Is there a way to detect this with MySQL or do I need to look at each grouping returned from MySQL and then use PHP to determine the percentage of similarity?

Any thoughts or suggestions?

John Woo
  • 258,903
  • 69
  • 498
  • 492
Paul
  • 11,671
  • 32
  • 91
  • 143
  • 1
    Not an answer: MySQL fulltext search indexes can detect "similar" text. So can numerous other search technologies. – Charles Feb 15 '12 at 02:21
  • MySQL's [`FULLTEXT`](http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) search might help with it's [boolean mode](http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html). – Bojangles Feb 15 '12 at 02:22
  • http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex and http://stackoverflow.com/questions/5322917/how-to-compute-similarity-between-two-strings-in-mysql – Cheery Feb 15 '12 at 02:23
  • hope your using a CAPTCHA, and email varification –  Feb 15 '12 at 02:30
  • What other data do you have access to? Timestamps? Are they getting users by they're ID or in ABC order? You might be able to find other patterns... – tplaner Feb 15 '12 at 02:36
  • They're getting users by IDs. Yes I use CAPTCHA, but not email verification. – Paul Feb 15 '12 at 03:16

1 Answers1

4

Fulltext Match

You could look at implementing something similar to the MATCH example here:

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root') AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

So for your example, perhaps:

SELECT id, MATCH (content) AGAINST ('your string') AS score
FROM messages 
WHERE MATCH (content) AGAINST ('your string')
    AND score > 1;

Note that to use these functions your content column would need to be a FULLTEXT index.

What is score in this example?

It is a relevance value. It is computed through the process described below:

Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row.

From the documentation page.

Michael Robinson
  • 29,278
  • 12
  • 104
  • 130
  • 1
    How is the score value relevant? Can I use this to determine a percentage of how similar messages are? Thanks – Paul Feb 15 '12 at 03:11
  • Trying it now but I receive an error: #1210 - Incorrect arguments to AGAINST My query: SELECT id, MATCH (content) AGAINST (content) AS score FROM messages WHERE MATCH (content) AGAINST (content) AND score > 1; – Paul Feb 15 '12 at 14:11
  • The `AGAINST()` argument [must be a constant string](http://dev.mysql.com/doc/refman/5.5/en/fulltext-restrictions.html). – Marcus Adams Feb 15 '12 at 15:00