Optimizing COUNT(DISTINCT) slowness even with covering indexes

Question

We have a table in MySql with arround 30 million records, the following is table structure

CREATE TABLE `campaign_logs` (
  `domain` varchar(50) DEFAULT NULL,
  `campaign_id` varchar(50) DEFAULT NULL,
  `subscriber_id` varchar(50) DEFAULT NULL,
  `message` varchar(21000) DEFAULT NULL,
  `log_time` datetime DEFAULT NULL,
  `log_type` varchar(50) DEFAULT NULL,
  `level` varchar(50) DEFAULT NULL,
  `campaign_name` varchar(500) DEFAULT NULL,
  KEY `subscriber_id_index` (`subscriber_id`),
  KEY `log_type_index` (`log_type`),
  KEY `log_time_index` (`log_time`),
  KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
  KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Following is my query

I'm doing UNION ALL instead of using IN operation

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       count(DISTINCT subscriber_id) AS COUNT,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_OPENED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_SENT'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_CLICKED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,

Following is my Explain statement

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                    |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                          |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+

I changed COUNT(subscriber_id) to COUNT(*) and observed no performance gain.

2.I removed COUNT(DISTINCT subscriber_id) from the query , then I got huge performance gain , I'm getting results in approx 1.5 sec, previously it was taking 50 sec - 1 minute. But I need distinct count of subscriber_id from the query

Following is explain when I remove COUNT(DISTINCT subscriber_id) from the query

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                                     |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using temporary; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                                           |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+

I ran three queries individually by removing UNION ALL. ONe query took 32seconds , others are taking 1.5 seconds each, but first query is dealing with around 350K records and others are dealing with only 2k rows

I could solve my performance problem by leaving out COUNT(DISTINCT...) but I need those values. Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?

UPDATE the following information is about data distribution of above table

for 1 domain 1 campaign 20 log_types 1k-200k subscribers

The above query I'm running for , the domain having 180k+ subscribers.

Why not `AND (log_type = 'EMAIL_OPENED' OR log_type = 'EMAIL_SENT' OR log_type = 'EMAIL_CLICKED')` — Tschallacka, Mar 16 '15 at 12:27
remove all index and just create a group index for (domain,campaign_id,log_type,log_time) — Ashok sri, Mar 16 '15 at 12:27
Try adding `ORDER BY NULL` after each `GROUP BY` this might get rid of the filesort. — GarethD, Mar 16 '15 at 12:39
Your `EXPLAIN` makes it clear that your compound index is being used as you intend. Some things to try: 1) change `COUNT(subscriber_id)` to `COUNT(*)` and see if performance improves. 2) try getting rid of `COUNT(DISTINCT subscriber_id)` and see if performance improves. Run each of the three subqueries you're combining with `UNION ALL` and see if one of them has poorer performance than the other two. Please update your question with results of those tests. — O. Jones, Mar 16 '15 at 13:28
Try `GROUP BY HOUR( CONVERT_TZ(log_time,'+00:00','+05:30'))` in place of your log_date result set column formatted `%l %p`. Numeric grouping may make it easier to process. — O. Jones, Mar 16 '15 at 13:41
@GarethD I tried like as you mentioned ORDER BY NULL, unfortunately I didn't get any better performance — Rams, Mar 16 '15 at 18:07
Here is just my understanding of what is going on inside the engine. It may spark some ideas. Your index helps to quickly find those 350K rows among 30M. Then engine has to read all these 350K rows to group and count them. When there is no `DISTINCT`: to `GROUP` them engine sorts 350K rows by the result of the `DATE_FORMAT` function and then steps through the result of the sort and counts rows in any order they appear. When you add `DISTINCT` the engine has to sort again within each group. Kind of nested sort. Apparently, this is not handled efficiently. — Vladimir Baranov, Mar 19 '15 at 05:32
You really want the number of different subscribers during the 3am hour for the month? And a different number for the 4am hour? This seems like a strange way to slice the data. — Rick James, Mar 20 '15 at 04:51
Do you realize you are 86399 seconds more than exactly one month? Sounds like a bug. — Rick James, Mar 20 '15 at 04:52
Will you accept the value of `COUNT(DISTINCT subscriber_id)` being off by about 1%? If so, (and you fix the `BETWEEN`), I can explain how to revise the query (and other things) to speed it up by at least 100-fold. — Rick James, Mar 20 '15 at 04:54
How long does the query take if you rewrite it using `IN` (so no `union`s) and remove the `count(distinct)`? — Gordon Linoff, Mar 21 '15 at 13:13
@GordonLinoff If I remove count(distinct) then query is so fast. there is no much difference between between IN and UNION. — Rams, Mar 23 '15 at 14:01

score 5 · Accepted Answer · edited Feb 21 '20 at 23:56

If the query without the count(distinct) is going much faster, perhaps you can do nested aggregation:

SELECT log_type, log_date,
       count(*) AS COUNT, sum(cnt) AS total
FROM (SELECT log_type,
             DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
             subscriber_id, count(*) as cnt
      FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
      WHERE DOMAIN = 'xxx' AND
            campaign_id = '123' AND
            log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
            log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND 
                             CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
      GROUP BY log_type, log_date, subscriber_id
     ) l
GROUP BY logtype, log_date;

With a bit of luck, this will take 2-3 seconds rather than 50. However, you might need to break this out into subqueries, to get full performance. So, if this does not have a significant performance gain, change the in back to = one of the types. If that works, then the union all may be necessary.

EDIT:

Another attempt is to use variables to enumerate the values before the group by:

SELECT log_type, log_date, count(*) as cnt,
       SUM(rn = 1) as sub_cnt
FROM (SELECT log_type,
             DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
             subscriber_id,
             (@rn := if(@clt = concat_ws(':', campaign_id, log_type, log_time), @rn + 1,
                        if(@clt := concat_ws(':', campaign_id, log_type, log_time), 1, 1)
                       )
              ) as rn
      FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index) CROSS JOIN
           (SELECT @rn := 0)
      WHERE DOMAIN = 'xxx' AND
            campaign_id = '123' AND
            log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
            log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00', '+00:00', '+05:30') AND 
                             CONVERT_TZ('2015-03-01 23:59:58', '+00:00', '+05:30')
      ORDER BY log_type, log_date, subscriber_id
     ) t
GROUP BY log_type, log_date;

This still requires another sort of the data, but it might help.

I think you won't benefit of the index in the inner query if you group by log_date, which is a calculated field. MySQL won't be able to use the defined index to sort and filter `subscriber_id`. So grouping by subscriber_id would be equivalent as counting distinct subscribers in term of performances. — Adam, Mar 21 '15 at 13:09
@Adam . . . (1) It will use the index, just not as completely as in the OPs queries with the `union`. I don't know the distribution of the data, so this might be sufficiently selective. In the last paragraph, I tried to suggest that the version with `union all` might be needed. (2) Although the outputs are the same, the underlying methods are not, and the `count(distinct)` might be slower than two aggregations. — Gordon Linoff, Mar 21 '15 at 13:11
Of course, I wasn't thinking that your query would be less efficient. It will use the index (`campaign_id,domain,log_type,log_time`) to match rows with the where condition and grouping by log_type. But I think that the slower part of the query is still the fact that the (calculated) log_date isn't part of the index and so counting/grouping subscribers will be slow as it won't take profit of the index. Things would have been different if log_date was part of the table structure. — Adam, Mar 21 '15 at 13:34
@GordonLinoff I tried query like you mentioned, but I didn't get any increase in performance. I mentioned about data distribution of table in my question. please check. — Rams, Mar 23 '15 at 05:47
@rams . . . Is the performance much faster if you have just `log_type = 'EMAIL_SENT'` instead of the `IN` ? I'm trying to figure out how important using the index is for all the conditions. — Gordon Linoff, Mar 24 '15 at 01:31
@GordonLinoff no even if I use log_type = 'EMAIL_SENT' , it's taking long time because there are more records that are matching with log_type = 'EMAIL_SENT' and with distinct subscriber_id. — Rams, Mar 24 '15 at 05:56

Vladimir Baranov · Answer 2 · 2015-03-25T22:29:21.640

To answer your question:

Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?

Yes, do not group by the calculated field (do not group by the result of the function). Instead, pre-calculate it, save it to the persistent column and include this persistent column into the index.

I would try to do the following and see if it changes performance significantly.

1) Simplify the query and focus on one part. Leave only one longest running SELECT out of the three, get rid of UNION for the tuning period. Once the longest SELECT is optimized, add more and check how the full query works.

2) Grouping by the result of the function doesn't let the engine use index efficiently. Add another column to the table (at first temporarily, just to check the idea) with the result of this function. As far as I can see you want to group by 1 hour, so add column log_time_hour datetime and set it to log_time rounded/truncated to the nearest hour (preserve the date component).

Add index using new column: (domain, campaign_id, log_type, log_time_hour, subscriber_id). The order of first three columns in the index should not matter (because you use equality compare to some constant in the query, not the range), but make them in the same order as in the query. Or, better, make them in the index definition and in the query in the order of selectivity. If you have 100,000 campaigns, 1000 domains and 3 log types, then put them in this order: campaign_id, domain, log_type. It should not matter much, but is worth checking. log_time_hour has to come fourth in the index definition and subscriber_id last.

In the query use new column in WHERE and in GROUP BY. Make sure that you include all needed columns in the GROUP BY: both log_type and log_time_hour.

Do you need both COUNT and COUNT(DISTINCT)? Leave only COUNT first and measure the performance. Leave only COUNT(DISTINCT)and measure the performance. Leave both and measure the performance. See how they compare.

SELECT log_type,
       log_time_hour,
       count(DISTINCT subscriber_id) AS distinct_total,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_OPENED'
  AND log_time_hour >= '2015-02-01 00:00:00' 
  AND log_time_hour <  '2015-03-02 00:00:00'
GROUP BY log_type, log_time_hour

score 1 · Answer 3 · answered Mar 16 '15 at 12:28

1

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       count(DISTINCT subscriber_id) AS COUNT,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_type, log_date

Add AND log_type IN ('EMAIL_OPENED', 'EMAIL_SENT', 'EMAIL_CLICKED') if needed.

answered Mar 16 '15 at 12:28

jarlh

42,561
8
45
63

I tried in this way, I didn't get any better performance.Thanks – Rams Mar 16 '15 at 12:32
Why does this help? This answer is useless with no additional explanation. You haven't even pointed out what you have changed. – GarethD Mar 16 '15 at 12:32
1

Ok, didn't improve performance... I'd expect somewhat faster execution since the table is read fewer times. Anyway, it will make code easier to understand and maintain... – jarlh Mar 16 '15 at 12:34

juacala · Answer 4 · 2015-03-19T05:07:33.230

I would try other orderings of the index you're using, moving the subscriber_id around, and see what the effect is. It's possible you can get better results by moving columns up with a higher cardinality.

At first, I thought it might only be using part of the index (not getting to subscriber_id at all). If it can't use subscriber_id, then moving it up the index tree will cause it to run slower, which will at least tell you it can't use it.

I can't think of much else you can play with.

Adam · Answer 5 · 2015-03-20T21:36:37.540

subscriber_id is of no use in your key because you are grouping by a calculated field outside the key (log_date) before counting distinct subscribers. It explains why this is so slow, because MySQL has to sort and filter duplicate subscribers without use of the key.
There might be an error with your log_time condition : you should have the opposite timezone conversion of your select (i.e. '+05:30','+00:00'), but it wont have any major incidence on your query time.
You can avoid the "union all" by doing a log_type IN (...) and group by log_type, log_date

Best effective solutions would be to add a mid-hour field to your database schema and set there one of the 48 mid-hour of the day (and take care of mid-hour timezone). So you could be able to use an index on campaign_id,domain,log_type,log_mid_hour,subscriber_id

This will be quite redundant but will improve the speed.

So this should led to some initializations in your table: be careful : do not test this on your production table

ALTER TABLE campaign_logs
   ADD COLUMN log_mid_hour TINYINT AFTER log_time;

UPDATE campaign_logs SET log_mid_hour=2*HOUR(log_time)+IF(MINUTE(log_time)>29,1,0);

ALTER TABLE campaign_logs
ADD INDEX(`campaign_id`,`domain`,`log_time`,`log_type`,`log_mid_hour`,`subscriber_id`);

You'll also have to set the log_mid_hour in your script for future records.

Your query will become (for an 11 mid-hour timeshift):

SELECT log_type,
   MOD(log_mid_hour+11, 48) tz_log_mid_hour,
   COUNT(DISTINCT subscriber_id) AS COUNT,
   COUNT(subscriber_id) AS total
FROM stats.campaign_logs
WHERE DOMAIN='xxx'
   AND campaign_id='123'
   AND log_type IN('EMAIL_SENT', 'EMAIL_OPENED','EMAIL_CLICKED')
   AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+05:30','+00:00')   
   AND CONVERT_TZ('2015-03-01 23:59:58','+05:30','+00:00')
GROUP BY log_type, log_mid_hour;

This will give you the count for each mid-hour taking full benefit of your index.

score 0 · Answer 6 · edited May 23 '17 at 12:14

I had a very similar problem, posted here on SO, and got some great help. Here's the thread: MySQL MyISAM slow count() query despite covering index

In a nutshell, I discovered that my problem had NOTHING to do with the query or the indexes, and EVERYTHING to do with the way I had set up the tables and MySQL. My exact same query became much faster when I:

Switched to InnoDB (which you're already using)
Switched the CHARSET to ASCII. If you don't need utf8, it takes 3x as much space (and time to search).
Make each column size as small as possible, not null if possible.
Increased MySQL's InnoDB buffer pool size. Many recommendations are to increase it to 70% of your RAM if this is a dedicated machine.
I Sorted my table by my covering index, wrote it out via SELECT INTO OUTFILE, and then re-inserted it into a new table. This physically sorts all the records in the search order.

I have no idea which of these changes fixed my problem (because I was unscientific and didn't try them one at a time), but it made my queries 50-100x faster. YMMV.

Optimizing COUNT(DISTINCT) slowness even with covering indexes

6 Answers6