3

Currently I have one monitor application in production. The job of this monitor is to collect specific entries on social networking like facebook, twitter, youtube and so on.

Here are one simple example of an API call from Twitter:

http://search.twitter.com/search?q=Stackoverflow&format=json

Basically, this is what the system does:

  1. Select the search term from database given an specific order
  2. Call the API
  3. Collect all tweets statuses IDs and users IDs from the current search
  4. Check on the database if it exists
  5. Run the tweets insertion eliminating existing tweets and users and preventing duplicated entry errors.

We finished with two tables, one for users and another for tweets.

THE PROBLEM

After the MySql database reached 200.000 entries on the tweets table (on the first months), the application that visualize that data started to consume too much resources when performing the select query on the existing tweets.

Why?

The system has separated accounts, each one has certain search terms related to their specific business. When we perform a select, we need to select only the ones that are associated with the terms of our account. We cannot see tweets the aren't related to us. But one tweet can be on many accounts.

The actual query (Hurting my eyes)

SELECT * 
   FROM tweets 
 WHERE 
   content LIKE '%searchterm1%' 
     OR LIKE '%searchterm2%' 
     OR LIKE '%searchterm3%' 
     OR LIKE '%searchterm4%' 
     OR LIKE '%searchterm5%' 
     OR LIKE '%searchterm6%' 
     OR LIKE '%searchterm7%' 
     OR LIKE '%searchterm8%' 
   (and son on...)

The two possible solutions

a) Create a tweets_searches table which will have two foreign keys: tweet_id and search_id. So that each tweet on the tweets table can be related to one specific search term.

So instead of search for a specific string, we will join these tables.

b) Continue searching, but instead, with fulltext searches using MATCH () AGAISNT ().

THE MAIN QUESTION

Is that enough to reduce the resources consumption like CPU and RAM? Is there anything better I can to do?

Keyne Viana
  • 6,194
  • 2
  • 24
  • 55
  • 2
    Why not use `match ... againts`? – Nemoden Jul 22 '11 at 03:22
  • 2
    Or you could use `Sphinx` - would be even better. – Nemoden Jul 22 '11 at 03:22
  • 2
    a full text search would be far more efficent –  Jul 22 '11 at 03:23
  • are you comparing the db for all the results of the search from twitter every single time its ran or are you keeping track of last pull time and only working with tweets from last pull? – plague Jul 22 '11 at 03:34
  • Thank you @Nemoden is it the best solution for mysql? – Keyne Viana Jul 22 '11 at 03:35
  • @i_forget I'm performing the query each time I access it against the entire database. So I need a cache system right? Which is the best one? Currently I'm using Zend Framework (has Zend_Cache), would you suggest anything different? – Keyne Viana Jul 22 '11 at 03:38
  • @Dagon : The most efficient? Can it be scalable for at last 2 millions records? Or the cache system is always necessary at this point? – Keyne Viana Jul 22 '11 at 03:39
  • it's one of the things you can try, benchmark, and see what works for your situation, no one here knows enough about your set up to say what's going to be the *best* option –  Jul 22 '11 at 03:47
  • I wouldn't say it is "the best solution". What it even mean in our profession? You can use plenitudes of approaches. If anybody has "the best solution" I'd call him a lier. You always can tweak your code so you program would consume more memory and work faster or you can use less memory, but your program will take longer to complete or you even can optimize your code so it would use less memory and run faster :-) Thats what interesting in optimization and one of the issues for high-loaded projects. Try `fulltext search`, try `Sphinx`, if doesn't help, welcome back to `stackoverflow` ;) – Nemoden Jul 22 '11 at 04:01
  • Thanks @Nemoden you can post as answer if you want. – Keyne Viana Jul 22 '11 at 04:12
  • Yes. No problem. I'm glad if my answer is helpful for you, so I assume it might be helpful for other people, so I think I'd publish it as an answer. – Nemoden Jul 22 '11 at 04:23
  • Depending on the ressources of your server, I've made the experience, that the dimension and size of your tweet table shouldn't be the problem (we're doing selects on table with a lots of million rows in production). 200.000 entries are nothing. I think your opinion, with a different database model would be a good solution. Doing a `like` search is horrible. For example (only some simple starting points): create a table which contains a `search_term` and a `tweet_id`, the search term field could contain a `soundex`ed value of the search term – rabudde Jul 22 '11 at 04:27
  • Hi @rabudde seems interesting. Maybe some think like this: http://stackoverflow.com/questions/4982317/mysql-full-text-search-error (Separated MyISAM table) ? Never heard about soundex I'll take a look. Thank you. – Keyne Viana Jul 22 '11 at 04:53
  • 2
    fulltext search is only available for MyISAM tables. But when you have high concurrent INSERTS and SELECTS (and I think so in your case) you should choose InnoDB. So you have a separate 'search-term-to-tweet-id' table. With `SOUNDEX()` you have a bit more failure tolerant search and after getting the `tweet_id`s from your lookup table it would be absolutly fast to get the tweets from tweet table – rabudde Jul 22 '11 at 05:05
  • Thank you all!! @rabudde Thanks for your mention about soundex. I'll give a try. – Keyne Viana Jul 23 '11 at 05:15

1 Answers1

4

Disclaimer: this is one of my comments on this question which might be the answer:


I think match ... against is appropriate here. It is so-called "fulltext search". For more complex searches, I'd use Sphinx - it indexes your database on its own (has own mechanism for it) and perform searches way faster than MySQL does

Nemoden
  • 8,816
  • 6
  • 41
  • 65