0

I have a MySQL table of 10 million rows and 3 columns, in following format:

id                                     time                               num

ca65e871-d758-437e-b76f-175234760e7b  2020-11-14 23:08:05.553770          11112222222
...

For running the first query below, I indexed the table on (num, time) and it works very fast (<5 milliseconds on 10 million rows table):

SELECT COUNT(*) 
FROM TABLE_NAME 
WHERE time >= '2020-11-14 23:08:05.553752' AND num = 11112222222

However I also need to execute count(distinct) on the same table with between clause, something like this:

SELECT COUNT(DISTINCT num) 
FROM TABLE_NAME 
WHERE time >= '2020-11-14 23:08:05.553752'
  AND num BETWEEN (11112222222 - 30)
              AND (11112222222 + 30)

This turns out to be significantly slower, around 200 milliseconds. Is there a way to speed the execution time of the second query on the same table?

Rick James
  • 135,179
  • 13
  • 127
  • 222
Makaroni
  • 880
  • 3
  • 15
  • 34
  • Does this answer your question? [postgresql COUNT(DISTINCT ...) very slow](https://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow) – Amira Bedhiafi Nov 24 '20 at 18:09
  • @SandraGuilepZouaouiZandeh Not quite. It speeds up the query a little bit, to around 160 milliseconds on average, but it is still slow for my purposes. – Makaroni Nov 24 '20 at 18:16
  • @Makaroni https://www.sisense.com/blog/use-subqueries-to-count-distinct-50x-faster/ – Amira Bedhiafi Nov 24 '20 at 18:19
  • 200 milliseconds is still not what I would call slow, but that's me. A certain amount of time of every execution is taken up by the SQL engine computing its "query plan", that is analyzing the query and the available indices to come up with the best strategy to satisfy the request. Perhaps the longer time is due primarily to more time being taken in coming up with the query plan and if you had double the number of rows the the execution execution time would only slightly increase. This, of course, is all conjecture. Could it be that for the first SQL only processing the index was necessary? – Booboo Nov 24 '20 at 18:22
  • 2
    You should look at the query plans by prefixing the SQL with `EXPLAIN `. Perhaps the timing differences will become clearer. – Booboo Nov 24 '20 at 18:27
  • How many rows need to be touched in the query? That is a significant factor in speed. – Rick James Nov 28 '20 at 02:38

2 Answers2

1

If your MySQl is 8+ then try:

WITH RECURSIVE
cte AS ( SELECT 11112222222 - 30 num
         UNION ALL
         SELECT num + 1 FROM cte WHERE num < 11112222222 + 30 )
SELECT COUNT(*)
FROM cte
WHERE EXISTS ( SELECT NULL
               FROM TABLE_NAME 
               WHERE TABLE_NAME.num = cte.num
                 AND time >= '2020-11-14 23:08:05.553752' )

If you'll often execute such query then I'd suggest to create service table with the numbers from -30 to 30 and use it instead of recursive CTE.

Akina
  • 39,301
  • 5
  • 14
  • 25
  • Absolutely amazing! This makes the query runs for less than 5 milliseconds. Thank you very much! Can you give me a brief explanation of what you did? – Makaroni Nov 24 '20 at 18:29
  • 1
    @Makaroni I generate the `num`s list within the range then simply test does there exists a row with this `num` within the date range. WHERE EXISTS checks the presence - i.e. it executes for each row but for each `num` it aborts after it find one value (whereas your query needs to find all matched rows). – Akina Nov 24 '20 at 18:32
  • Fantastic! I will read this explanation several times in order to really figure it out. :)) Thanks again! – Makaroni Nov 24 '20 at 18:35
  • BTW, can you help me with similar problem of using `count(distinct)`? Can I apply `recursive cte` in this case also? https://stackoverflow.com/questions/65005378/speeding-up-countdistinct-of-one-column-based-on-condition-on-other-column-in – Makaroni Nov 25 '20 at 13:38
1

This is a 2D problem. And your WHERE clause is like a "bounding box".

Add INDEX(time, num) to give the Optimizer another choice.

If you expect to have more complex queries based on 2 ranges, see http://mysql.rjweb.org/doc.php/find_nearest_in_mysql

Rick James
  • 135,179
  • 13
  • 127
  • 222