query optimization to find random sample

Question

I have an sql query to select randomly 1200 top retweeted tweets at least 50 times retweeted and the tweetDate should be 4 days older from 40 million records. The query I pasted below works but It takes 40 minutes, so is there any faster version of that query?

SELECT 
    originalTweetId, Count(*) as total, tweetContent, tweetDate
FROM
    twitter_gokhan2.tweetentities
WHERE
    originalTweetId IS NOT NULL
        AND originalTweetId <> - 1
        AND isRetweet = true
    AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
GROUP BY originalTweetId
HAVING total > 50
ORDER BY RAND()
limit 0 , 1200;


---------------------------------------------------------------

Table creation sql is like:

    CREATE TABLE `tweetentities` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `tweetId` bigint(20) NOT NULL,
      `tweetContent` varchar(360) DEFAULT NULL,
      `tweetDate` datetime DEFAULT NULL,
      `userId` bigint(20) DEFAULT NULL,
      `userName` varchar(100) DEFAULT NULL,
      `retweetCount` int(11) DEFAULT '0',
      `keyword` varchar(500) DEFAULT NULL,
      `isRetweet` bit(1) DEFAULT b'0',
      `isCompleted` bit(1) DEFAULT b'0',
      `applicationId` int(11) DEFAULT NULL,
      `latitudeData` double DEFAULT NULL,
      `longitude` double DEFAULT NULL,
      `originalTweetId` bigint(20) DEFAULT NULL,
     PRIMARY KEY (`id`),
  KEY `index` (`originalTweetId`),
  KEY `index3` (`applicationId`),
  KEY `index2` (`tweetId`),
  KEY `index4` (`userId`),
  KEY `index5` (`userName`),
  KEY `index6` (`isRetweet`),
  KEY `index7` (`tweetDate`),
  KEY `index8` (`originalTweetId`),
  KEY `index9` (`isCompleted`),
  KEY `index10` (`tweetContent`(191))
) ENGINE=InnoDB AUTO_INCREMENT=41501628 DEFAULT CHARSET=utf8mb4$$

Insufficient information, it is hard to understand how your tables look, how the keys are used in it, etc. Please provide DB structure, some sample data. — Noam Rathaus, Nov 19 '13 at 14:13
Yes, kind of. I am collecting tweets based on specific keywords to analyze. — mgokhanbakal, Nov 19 '13 at 14:21
This `WHERE` clause defeats the use of an index. `originalTweetId IS NOT NULL` — O. Jones, Nov 19 '13 at 14:24
I know that but I have to use it like this for a while because some of the records have null or -1 in originalTweetId field. — mgokhanbakal, Nov 19 '13 at 14:26
You could have a more complex index which specifically includes the terms you're using — vogomatix, Nov 19 '13 at 14:28
Why are you manually counting the number of retweets when there is a `retweetCount` column in the table? Why not using `AND retweetCount >= 50` in the WHERE clause? — Michael L., Nov 19 '13 at 14:46
I am counting because retweetCount field shows only retweet counts of the tweet when it is saved to db. — mgokhanbakal, Nov 19 '13 at 14:49

O. Jones · Accepted Answer · 2013-11-19T15:13:37.413

You are, of course, summarizing a huge number of records, then randomizing them. This kind of thing is hard to make fast. Going back to the beginning of time makes it worse. Searching on a null condition just trashes it.

If you want this to perform reasonably, you must get rid of the IS NOT NULL selection. Otherwise, it will perform badly.

But let us try to find a reasonable solution. First, let's get the originalTweetId values we need.

 SELECT MIN(id) originalId,
        MIN(tweetDate) tweetDate,
        originalTweetId, 
        Count(*) as total
   FROM twitter_gokhan2.tweetentities
  WHERE originalTweetId <> -1 
  /*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
    AND isRetweet = true
    AND tweetDate < CURDATE() - INTERVAL 4 DAY
    AND tweetDate > CURDATE() - INTERVAL 30 DAY  /*let's add this, if we can*/
  GROUP BY originalTweetId
 HAVING total >= 50

This summary query gives us the lowest id number and date in your database for each subject tweet.

To get this to run fast, we need a compound index on (originalTweetId, isRetweet, tweetDate, id). The query will do a range scan of this index on tweetDate, which is about as fast as you can hope for. Debug this query, both for correctness and performance, then move on.

Now do the randomization. Let's do this with the minimum amount of data we can, to avoid sorting some enormous amount of stuff.

 SELECT originalTweetId, tweetDate, total, RAND() AS randomOrder
   FROM (
    SELECT MIN(id) originalId,
           MIN(tweetDate) tweetDate
           originalTweetId, 
           Count(*) as total
      FROM twitter_gokhan2.tweetentities
     WHERE originalTweetId <> -1 
     /*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
       AND isRetweet = true
       AND tweetDate < CURDATE() - INTERVAL 4 DAY
       AND tweetDate > CURDATE() - INTERVAL 30 DAY  /*let's add this, if we can*/
      GROUP BY originalTweetId
     HAVING total >= 50
   ) AS retweets
  ORDER BY randomOrder
  LIMIT 1200

Great. Now we have a list of 1200 tweet ids and dates in random order. Now let's go get the content.

 SELECT a.originalTweetId, a.total, b.tweetContent, a.tweetDate
   FROM (
          /* that whole query above */
        ) AS a
   JOIN twitter_gokhan2.tweetentities AS b ON (a.id = b.id)
  ORDER BY a.randomOrder

See how this goes? Use a compound index to do your summary, and do it on the minimum amount of data. Then do the randomizing, then go fetch the extra data you need.

Does moving `RAND()` into the column location of the same query really avoid a full table scan? (typical `ORDER BY RAND()` would cause a table scan and possibly temporary table, see: http://stackoverflow.com/questions/4329396/mysql-select-10-random-rows-from-600k-rows-fast. Intuitively it seems putting it in sub-query might help, since it's already grouped at that level, but it seems like column vs order by location shouldn't matter.) — , Nov 19 '13 at 15:08
I put RAND() into the subquery to ensure that the outer query's order is preserved. I'm a bit of a pedant about remembering that the order of query results is unpredictable in the absence or ORDER BY clauses. — O. Jones, Nov 19 '13 at 15:11
Functionally the location (which query) seems fine. I'm asking about why you changed it from `ORDER BY RAND()` to `RAND() AS foo` then `ORDER BY foo`. I was wondering if that had any performance impact at all. — , Nov 19 '13 at 15:13
I haven't tested it. The query ending in `LIMIT 1200` still has to consider every row for the row-sampling to be properly random. So there's no huge performance increase here. I just did it that way to make the ordering stable. — O. Jones, Nov 19 '13 at 15:17
And yes, there is no way to escape the use of a temporary table here. We are manipulating the resultset of a summary subquery. — O. Jones, Nov 19 '13 at 15:21
True, but the hope is we're only manipulating at most 40 million / 50 rows instead of 40 million rows... (I believe the OP query would have been operating on all 40 million, though query optimization is always advancing) huh, I just realized... for rand() it'd probably be faster to pick 1200 random timestamps and ... yep gonna have to try it. — , Nov 19 '13 at 15:27

vogomatix · Answer 2 · 2013-11-19T14:27:51.647

0

You're selecting a huge number of records by selecting every record older than 4 days old....

Since the query takes a huge amount of time, why not simply prepare the results using an independant script which runs repeatedly in the background....

You might be able to make the assumption that if its a retweet, the originalTweetId cannot be null/-1

edited Nov 19 '13 at 14:27

answered Nov 19 '13 at 14:22

vogomatix

4,856
2
23
46

See amended idea. I presume you're preparing this data for a website, so just rerun the process in the background continuously. – vogomatix Nov 19 '13 at 14:26
One application is finding originalTweetId for each record and update that field; however, if it is not possible to access the metadata becuase of protected tweet account issue or deleted tweet issue, I put -1 in this field. – mgokhanbakal Nov 19 '13 at 14:31
Why not have an automated process that runs and updates the retweetCount so you can use it? – Michael L. Nov 19 '13 at 14:57

score 0 · Answer 3 · answered Nov 19 '13 at 14:38

0

Just to clarify... did you really mean to query everything OLDER than 4 days???

 AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))

OR... Did you mean you only wanted to aggregate RECENT TWEETS WITHIN the last 4 days. To me, tweets that happened 2 years ago would be worthless to current events... If thats the case, you might be better to just change to

 AND (tweetDate >= DATE_ADD(CURDATE(), INTERVAL - 4 DAY))

answered Nov 19 '13 at 14:38

DRapp

47,638
12
72
142

I am looking for at least 4 days old tweets not current tweets in 4 days. – mgokhanbakal Nov 19 '13 at 14:43

score 0 · Answer 4 · 2013-11-19T16:17:01.057

See if this isn't a bit faster than 40 minutes:

Test first without the commented lines, then re-add them to compare performance impact. (especially ORDER BY RAND() is known to be horrible)

SELECT
    originalTweetId,
    total,
--    tweetContent,  -- may slow things somewhat
    tweetDate
FROM (
  SELECT
    originalTweetId,
    COUNT(*) AS total,
--    tweetContent,  -- may slow things somewhat
    MIN(tweetDate) AS tweetDate,
    MAX(isRetweet) AS isRetweet
  FROM twitter_gokhan2.tweetentities
  GROUP BY originalTweetId
) AS t
WHERE originalTweetId > 0
  AND isRetweet
  AND tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY)
  AND total > 50
-- ORDER BY RAND()  -- very likely to slow performance,
                    -- test with and without...
LIMIT 0, 1200;

PS - originalTweetId should be indexed hopefully

query optimization to find random sample

4 Answers4