GROUP BY with ORDER BY extremely slow on database (seems to create temp table)

Question

I'm building a twitter app that displays posted links on twitter, but I have a problem when sorting the table by time.

tweet
+----------------------------------------+
| tweet_id | [...] | created_at          |
+----------------------------------------+  
| 123456   | [...] | 2012-06-11 11:31:28 |
| 234567   | [...] | 2012-06-11 11:32:55 |
| 345678   | [...] | 2012-06-11 11:33:22 |
+----------------------------------------+

tweets_url
+---------------------+
| tweet_id | url      |
+---------------------+
| 123456   | cnn.com  |
| 123456   | fox.com  |
| 234567   | abc.com  |
| 345678   | abc.com  |
+---------------------+

Heres my SQL (I'm using GROUP by to return only unique URLS):

SELECT tweet_urls.url,
    FROM  `tweets` 
    LEFT JOIN tweet_urls ON tweet_urls.tweet_id = tweets.tweet_id 
    WHERE tweet_urls.url LIKE '%cnn.com%'
    GROUP BY tweet_urls.url 
    ORDER BY tweets.created_at DESC LIMIT 0 , 20

I tried different variations of running this query with outer select from here, using different joins and inner SELECTS.

Edit: I've done some further testing. It seems that Mysql creates a temporary table based on the GROUP BY tweet_urls.url and then orders the results w/o using the specified index because it is run on a temp table.

Here's the EXPLAIN output:

+----+-------------+------------+--------+---------------+---------+---------+-----+----------------------+----------------------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref |                rows  | Extra                                        |
+---------------------------------------------------------------------------------------------------------+----------------------------------------------+
| 1  | SIMPLE      | tweet_urls | index  | tweet_id      | url     | 422     | NULL                 86783 | Using where; Using temporary; Using filesort 
| 1  | SIMPLE      | tweets     | eq_ref | PRIMARY       | PRIMARY | 8       | tweet_urls.tweet_id        |
+----+-------------+------------+--------+---------------+---------+---------+-----+----------------------+----------------------------------------------+

Can you avoid starting your LIKE expression with a wildcard? Otherwise a table scan will be necessary. — Martin Wilson, Jun 18 '12 at 18:38
I will put index on both tweet_id and a fulltext index to tweet_urls.url and if you don't need the order by, reove it — jcho360, Jun 18 '12 at 18:40
Are you aware that `LIKE '%cnn.com%'` will also match `abcnn.comp.co.uk`? Is that really what you want? — Mark Byers, Jun 18 '12 at 18:55
Are you certain that it's the `ORDER BY` that is the largest contributor to the slow performance? How does the performance compare to a query with the `ORDER BY` clause omitted? (For testing, you want to ensure that the query cache isn't throwing off your results: `SET SESSION query_cache_type = OFF;` — spencer7593, Jun 18 '12 at 20:02
Yep, I'm sure. Query time is 0.0021s vs 0.4639. Tests are run on a development server with a small set of data, on the live server the query times are like 0.01s vs 2.5s. — Joey, Jun 18 '12 at 20:56

Mark Byers · Answer 1 · 2012-06-18T18:43:55.330

7

I think the real problem is here:

WHERE tweet_urls.url LIKE '%cnn.com%'

This type of query (LIKE without a constant prefix) cannot efficiently use an index.

You can solve this problem by adding an extra column to your table, called domain and indexing that. You can then change your query to:

WHERE tweet_urls.domain = 'cnn.com'

edited Jun 18 '12 at 18:43

answered Jun 18 '12 at 18:38

Mark Byers

811,555
193
1,581
1,452

what different will made add another column and not just indexing the url column? – jcho360 Jun 18 '12 at 18:42
Thanks for you reply, but that did not solve the long query time althoug it gives a small performance boost. – Joey Jun 18 '12 at 20:46
@Joey: Strange. Can you post the EXPLAIN? – Mark Byers Jun 18 '12 at 22:40
@MarkByers Hope format gets not cracked: id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE tweet_urls index tweet_id url 422 NULL 86783 Using where; Using temporary; Using filesort 1 SIMPLE tweets eq_ref PRIMARY PRIMARY 8 tweet_urls.tweet_id 1 Using where edit: got cracked. Sorry I'm new here. Formatting on replies seems to be a hassle ;) – Joey Jun 19 '12 at 00:37
@MarkByers made some changes to the startpost to display the EXPLAIN correctly. – Joey Jun 19 '12 at 18:15
@joey: i noticed that the index it is using is the wrong one. It should be using the index on domain. My guess is that your query is wrong. Can you post that too? – Mark Byers Jun 19 '12 at 18:17
@MarkByers I changed the url rows to only contain the domain name like you suggested. I basically trimmed everything before and after the domain to ensure that an index can be used. Query is: SELECT tweet_urls.url, FROM `tweets` LEFT JOIN tweet_urls ON tweet_urls.tweet_id = tweets.tweet_id WHERE tweet_urls.url = 'cnn.com' GROUP BY tweet_urls.url ORDER BY tweets.created_at DESC LIMIT 0 , 20 – Joey Jun 19 '12 at 18:26
@Joey: I've looked more closely at your query and it seems that it isn't well defined. You have GROUP BY tweet_urls.url but then you use ORDER BY tweets.created_at. If you use a column in the ORDER BY unless it should either be contained in an aggregate function or be functionally dependent on the columns in the GROUP BY, which I don't think is the case here. I suggest that you first try to make your query correct before trying to optimize it. – Mark Byers Jun 19 '12 at 21:10

score 3 · Answer 2 · answered Jun 18 '12 at 18:34

3

Put an index on the tweets.created_at column

answered Jun 18 '12 at 18:34

John Conde

217,595
99
455
496

sorry, forgot to mention: is indexed. – Joey Jun 18 '12 at 18:36
Is the tweet_url.url indexed? You're doing a wildcard search on a varchar column, which can be quite taxing if there are a lot of matches. Are you only storing the domain the tweet came from? It might be worthwhile making a key into that, as well, if you're seeing a lot of repetitions. – Mike Jun 18 '12 at 18:38
1

@Mike: It could be *even more* taxing if there *aren't* a lot of matches. If you have less than 20 matches then every row in the table must be checked. – Mark Byers Jun 18 '12 at 18:39
As soon as I wrote it I realized I didn't mean matches, i meant a lot of records. I upvoted your response below, as we posted at the same time and had the same idea. Thanks for correcting this mistake- i'll eave it as is to show the error of my ways. – Mike Jun 18 '12 at 18:40
1

Thanks for you reply. Even if I replace the GROUP BY completely it has no effect on the performance of the query. In fact it gets even slower because there are more results to ORDER BY ;) – Joey Jun 18 '12 at 20:50

spencer7593 · Answer 3 · 2012-06-18T22:48:07.283

Before you tune the statement, please be sure that the statement is guaranteed to return a correct result set, the result set you are expecting. (see below)

As for performance, the LIKE '%foo' predicate (with the leading wildcard) is not sargable. (That is, the query engine can't make use of an index to limit the number of rows to be checked. The query engine will need to check EVERY row in the table.

I suspect that this, along with the JOIN operation, could be a major contributor to the slow performance. (I don't see that an OUTER join is required in your query, it appears to be equivalent to an INNER join, given the predicate on tweet_urls.url.

Ideally, you would not need the % wildcard character at the beginning, and you could instead check url LIKE 'cnn.com%', without a leading wildcard, which has a chance of getting the query engine to use an index (on the url column).

Obviously, some changes to the table and index definitions could help performance, but only if you have some liberty to make those changes. (Often, the posters asking questions like yours are limited in what changes they are able to make.)

So, I am addressing only the query you have, and not suggesting any schema changes. (IF I were Faced with a requirement like yours, I would be considering making changes to the tables and indexes, possibly even a FULLTEXT index.)

But you asked about changing the query, so I'm only going to address the query.

It looks as if you mean to return distinct urls, ordered by the created_at of the LATEST tweet, and not just by the created_at of one (not necessarily the latest) tweet.

If so, I don't believe your query is returning the rows in the order you expect, given that you are referencing a non-aggregate in the ORDER BY clause which is not included in the GROUP BY clause.

NOTE: Other relational databases will throw an exception with a statement like this, e.g. Oracle will throw ORA-00979: not a GROUP BY expression and SQL Server will throw Msg 8127 Column "tweets.created_at" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause. MySQL is more liberal, which is not always a good thing.

If your query is returning rows in the order you expect, that is by happy accident, and is not due to some guaranteed behavior.

To get the rows ordered by the created_at of the LATEST tweet, you'd need a query of a different form, one that specifies that you want to sort based on the maximum created_at. As examples:

SELECT tweet_urls.url
  FROM tweet_urls
 WHERE tweet_urls.url LIKE '%cnn.com%'
 GROUP BY tweet_urls.url
 ORDER BY MAX((SELECT MAX(tweets.created_at) FROM tweets WHERE tweets.tweet_id = tweet_urls.tweet_id)) DESC LIMIT 0, 20

-- or

SELECT t.url
  FROM ( SELECT tweet_urls.url, MAX(tweets.created_at) AS max_created_at
           FROM tweets 
           JOIN tweet_urls ON tweet_urls.tweet_id = tweets.tweet_id 
          WHERE tweet_urls.url LIKE '%cnn.com%'
          GROUP BY tweet_urls.url
       ) t
 ORDER BY t.max_created_at DESC LIMIT 0, 20

Setup the test case you showed:

CREATE TABLE tweets (tweet_id INT UNSIGNED NOT NULL, created_at DATETIME) ENGINE=MyISAM; 
CREATE TABLE tweet_urls (tweet_id INT UNSIGNED NOT NULL, url VARCHAR(20) NOT NULL) ENGINE=MyISAM;
INSERT INTO tweets VALUES (123456, '2012-06-11 11:31:28'),(234567,'2012-06-11 11:32:55'),(345678,'2012-06-11 11:33:22');
INSERT INTO tweet_urls VALUES (123456,'cnn.com'),(123456,'fox.com'),(234567,'abc.com'),(345678,'abc.com');

And add a couple more rows:

INSERT INTO tweets VALUES (1, '2012-06-10'),(2,'2012-06-12'); 
INSERT INTO tweet_urls VALUES (1,'Xcnn.com'),(2,'Xcnn.com');

When I run your query, it returns rows in order by SOME created_at, but not necessarily by the created_at of the LATEST tweet.

Thanks for this great answer, it helped my a lot although your queries run 5 times slower than mine on the development server with a reduced set of data. Selecting the latest tweet is the next problem im facing and I think the whole problem lies on the GROUP BY clause. GROUP BY is returning unsorted data, which has to be sorted by the ORDER BY. I'm relatively new to developing with databases, so I don't know how to avoid this. — Joey, Jun 18 '12 at 23:26

GROUP BY with ORDER BY extremely slow on database (seems to create temp table)

3 Answers3