6

My table 'my_logs' have about 20,000,000 records, and I want to find out how many logs I have in each date within a few days.

I want to have a result like

+------------+---------+ | date | count | +------------+---------+ | 2016-07-01 | 1623 | | 2016-07-02 | 1280 | | 2016-07-03 | 2032 | +------------+---------+

This query below only take me milliseconds to finish, that's good

SELECT  DATE_FORMAT(created_at, '%Y-%m-%d') as date,
        COUNT(*) as count
    FROM  my_logs
    WHERE  created_at BETWEEN '2016-07-01' AND '2016-07-04'
    GROUP BY  DATE_FORMAT(created_at, '%Y-%m-%d')

The Explain of query:

+------------+---------+-------+-----------------------------+ |select_type | table | type | possible_keys | +------------+---------+-------+-----------------------------+ | SIMPLE | my_logs| index | index_my_logs_on_created_at | +------------+---------+-------+-----------------------------+

+-----------------------------+---------+----------+ | key | key_len | rows | +-----------------------------+---------+----------+ | index_my_logs_on_created_at | 10 | 23458462 | +-----------------------------+---------+----------+

+-----------------------------------------------------------+ | Extra | +-----------------------------------------------------------+ | Using where; Using index; Using temporary; Using filesort | +-----------------------------------------------------------+

However, I need to convert the timezone of each record to fit the time in my country, and I need to group by the 'Date' information, so I need to convert the column itself.

Both

SELECT  COUNT(*)
    FROM  my_logs
    WHERE  DATE_ADD(created_at, INTERVAL 8 HOUR) BETWEEN '2016-07-01' AND '2016-07-04'
    GROUP BY  DATE_FORMAT(DATE_ADD(created_at, INTERVAL 8 HOUR), '%Y-%m-%d')

and

SELECT  COUNT(*)
    FROM  my_logs
    WHERE  CONVERT_TZ(created_at, "+00:00", "+08:00") BETWEEN '2016-07-01' AND '2016-07-04'
    GROUP BY  DATE_FORMAT(CONVERT_TZ(created_at, "+00:00", "+08:00"),

'%Y-%m-%d')

take me about 12s to finish the query, it is unbearable slow!!

(The Explain is the same as the query in the top)


I think it is common problem but I can't find a good way to deal with it, does anyone has a more efficient way to do it? Thanks!

Rick James
  • 135,179
  • 13
  • 127
  • 222
林鼎棋
  • 1,995
  • 2
  • 16
  • 25
  • 1
    [`convert_tz`](http://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html#function_convert-tz) converts from one timezone to another. – Jonathon Ogden Aug 25 '16 at 07:40
  • Can you provide the query plans (`EXPLAIN` output) for both cases? – Vatev Aug 25 '16 at 07:59
  • I add the output! Thanks. Actually, I already got the output I want, but my problem is about the efficiency. After considering the timezone issue, I can't find an efficiency way to do something which I think should be easy originally. – 林鼎棋 Aug 25 '16 at 08:14
  • I meant the output of [EXPLAIN ](https://dev.mysql.com/doc/refman/5.7/en/explain.html), which shows what the database is actually going to do when the query runs. It helps a lot when trying to determine how to make it faster. – Vatev Aug 25 '16 at 08:19
  • OK, I add it, too. The Explain is all the same. – 林鼎棋 Aug 25 '16 at 08:53
  • Only miliseconds for a 20M rows table scan doesn't sound right. Can you try adding [SQL_NO_CACHE](http://dev.mysql.com/doc/refman/5.7/en/query-cache-in-select.html) to the first query and try it again? – Vatev Aug 25 '16 at 11:07
  • The exact time comsume is 190~220 ms, after add SQL_NO_CACHE, it comes to 193 ms. – 林鼎棋 Aug 26 '16 at 08:04
  • `SQL_NO_CACHE` bypasses the "Query cache", thereby providing an 'honest' timing for the query. (Without that, you might get a bogus 1ms for a complex query on a big table.) 190~220ms is too big for the QC to be involved. – Rick James Aug 26 '16 at 19:00

1 Answers1

5

Which datatype, TIMESTAMP vs. DATETIME, did you use? (But, I'll ignore that.)

Do not "hide" an indexed column (created_at) inside any function (CONVERT_TZ()). It makes it so that the WHERE clause cannot use the index and must scan the table instead. This fix is simple:

WHERE created_at >= '2016-07-01' - INTERVAL 8 HOUR
  AND created_at  < '2016-07-04' - INTERVAL 8 HOUR

(or use CONVERT_TZ). Note that I also fixed the bug wherein you included midnight from the 4th. Note: Even + INTERVAL... is effectively a function.

Expressions in the SELECT and the GROUP BY are far less critical to performance.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Great! So the best way to do it is converting the date range in WHERE clause and converting the column in GROUP clause! That solve my problem! Thanks – 林鼎棋 Aug 26 '16 at 08:14