Need Help on Optimizing an Interesting MySQL Query

Question

Query Optimization

I need help on optimizing the performance of this query. This query basically looks for the accumulative sum of all the periods that matches the list of case when conditions.

Currently, this query takes about 100 secs to run since it groups by every accounts in the database. I tried to optimize it by looking at the explain output but I couldn't figure out a way to make it work. Here is the explain output:

The idea time would be 10 sec or less. Looking forward to your response. Thank you!

SET @date = '2017-05-17';
SET @offset = 1;

select 
b.act,
CASE 
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 5 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 5 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=5 THEN 5
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 13 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 13 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=13 THEN 13
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 25 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 25 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=25 THEN 25
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 45 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 45 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=45 THEN 45
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 75 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 75 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=75 THEN 75
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 105 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 105 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=105 THEN 105
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 135 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 135 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=135 THEN 135
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 165 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 165 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=165 THEN 165
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 195 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 195 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=195 THEN 195
WHEN b.jdt <= DATE_SUB(@date, INTERVAL 225 DAY) AND b.jdt >= DATE_SUB(@date, INTERVAL 225 + @offset DAY) AND DATEDIFF(a.dt,b.jdt) <=225 THEN 225
ELSE 'other' END AS 'period',

SUM(CASE WHEN a.type = 'JN' AND a.paid = 'Y' AND a.upgraded=0 THEN 1 ELSE 0 END) AS 'Paid_Joins',
SUM(CASE WHEN a.type IN ('SL','RL') AND ttype !='Purchase' THEN (a.amt_usd/100 - a.vat_usd/100) END) AS 'Revenue_Amount'

FROM __customer b
JOIN  __transaction a on b.uid = a.primary_uid 

WHERE
b.affiliate_act regexp '^[a-zA-Z]+[0-9]+'
AND a.dt <= @date 
AND a.dt >= DATE_SUB(@date, INTERVAL 225 + @offset DAY)
AND b.jdt >= DATE_SUB(@date, INTERVAL 225 + @offset DAY)
GROUP BY 1,2
HAVING period != 'other'

UPDATE

Table structure:

UPDATE2

I queried on the transaction table with the same query logic without the customer table join, and it looks like it still scanning the same row as the join. Since it looks through every combination in the database, I can't of thinking adding more effective where clauses to limit the number of rows scanned.

SET @date = '2017-05-17';
SET @offset = 2;
SET @start = DATE_SUB(@date, INTERVAL 225 + @offset DAY);

explain
select 
    a.account,        
    SUM(CASE WHEN a.type = 'JN' AND a.paid = 'Y' AND a.upgraded=0 
             THEN 1 
             ELSE 0 
        END) AS 'Paid_Joins'       
FROM __transaction a        
WHERE a.account regexp '^[a-zA-Z]+[0-9]+'
  AND a.dt <= @date 
  AND a.dt >= @start
-- AND b.affiliate_act = 'el4557'
GROUP BY 1

The number of rows scanned here is the same as if there is a join.

Performance questions should include `EXPLAIN ANALYZE` and some information about table size, index, current time performance, desire time, etc. `Slow` is a relative term and we need a real value to compare. [**MySQL**](http://dba.stackexchange.com/questions/15371/how-do-i-get-the-execution-plan-for-a-view) [**How do I obtain a Query Execution Plan?**](http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan) — Juan Carlos Oropeza, Aug 09 '17 at 19:57
I don't think there is EXPLAIN ANALYZE in MySQL though, but I tried to attach something similar that might help. — smzsmzsmz, Aug 09 '17 at 22:18
We need your create table and your index so we can understand what you have already. And the links I provide tell how yo get the Query Execution Plan. But for the picture you post showing a `FULL TABLE SCAN` looks like you have no index for date. You probably need a composite index `(uid, date)` — Juan Carlos Oropeza, Aug 09 '17 at 22:45
Thanks for your reply. I do have index though but certainly not composite index. I have attached the table structure. There is anything that I have to change on the index to improve the performance? — smzsmzsmz, Aug 09 '17 at 22:55
I told you use a composite index. Also you are trying to solve a big problem, instead try to solve a smaller problem first. remove all the `CASE` and `GROUP BY` test the `WHERE` with the new index and if things get better start adding more parts. — Juan Carlos Oropeza, Aug 10 '17 at 14:49
Juan Carlos Oropeza: I have added composite index (jdt, uid) on customer and (type,dt) on transaction. Also, I have checked the query without the join. Please see my update. — smzsmzsmz, Aug 10 '17 at 18:34
Well you change the `WHERE` now there is a `b.affiliate_act regexp '^[a-zA-Z]+[0-9]+'` that will need a full table scan. You can try to do a sub query to get the data between the dates and then search for the regexp to see if is faster — Juan Carlos Oropeza, Aug 10 '17 at 18:45
Check the edit, you don't need multiple `SUM()` to show us the problem try to make the problem simple so is easy to decipher where is the problem — Juan Carlos Oropeza, Aug 10 '17 at 18:54
OK. Regardless of the regular expression as well as date in the where clause, it is always a full table scan. I guess that's the main question. — smzsmzsmz, Aug 10 '17 at 19:26
Show me the create index. How many days are on that table? Is any chance all the date are in that range? — Juan Carlos Oropeza, Aug 10 '17 at 19:35
cease using images, copy the data displayed in those images and paste it instead (then format as code) ... way better for you and for us. — Paul Maxwell, Aug 11 '17 at 01:11
sigh. you keep changing the query logic hence you keep chasing for an elusive answer. I suggest you learn how to interpret the explain plan output. **Look at "possible key" column** then look at the comment on the same row, is it "using index"? if no, then try creating the suggested (possible) key. After creating that possible key run the explain again (making no other changes to the query) and than ask: is the new index being used? — Paul Maxwell, Aug 11 '17 at 01:14
using regexp in your where clause is "not sargable" (unable to use indexes) so that will force a table scan. — Paul Maxwell, Aug 11 '17 at 01:15

LAS · Answer 1 · 2017-08-11T00:33:25.340

0

Make sure a.primary_uid, a.dt and b.uid are indexed. Try a composite a.dt, a.primary_uid.

With 5,187,819 rows processed, DATE_SUB is potentially called 103,756,380 times depending on how the optimizer interprets the code. This is why I recommend pulling it out of the query where it will be called ten times.

Try pre-calculating the date intervals so they're not calculated for each iteration of the case statement. You could also pre-calculate the date interval minus the offset. I'll leave that to you if this helps.

SET @date = '2017-05-17';
SET @offset = 1;
SET @dtint5 = DATE_SUB(@date, INTERVAL 5 DAY);
SET @dtint13 = DATE_SUB(@date, INTERVAL 13 DAY);
SET @dtint25 = DATE_SUB(@date, INTERVAL 25 DAY);
SET @dtint45 = DATE_SUB(@date, INTERVAL 45 DAY);
SET @dtint75 = DATE_SUB(@date, INTERVAL 75 DAY);
SET @dtint105 = DATE_SUB(@date, INTERVAL 105 DAY);
SET @dtint135 = DATE_SUB(@date, INTERVAL 135 DAY);
SET @dtint165 = DATE_SUB(@date, INTERVAL 165 DAY);
SET @dtint195 = DATE_SUB(@date, INTERVAL 195 DAY);
SET @dtint225 = DATE_SUB(@date, INTERVAL 225 DAY);

select 
b.act,
CASE 
WHEN b.jdt <= @dtin5 AND b.jdt >= @dtint5 - @offset AND DATEDIFF(a.dt,b.jdt) <=5 THEN 5
WHEN b.jdt <= @dtint13 AND b.jdt >= @dtint13 - @offset AND DATEDIFF(a.dt,b.jdt) <=13 THEN 13
WHEN b.jdt <= @dtint25 AND b.jdt >= @dtint25 - @offset AND DATEDIFF(a.dt,b.jdt) <=25 THEN 25
WHEN b.jdt <= @dtint45 AND b.jdt >= @dtint45 - @offset AND DATEDIFF(a.dt,b.jdt) <=45 THEN 45
WHEN b.jdt <= @dtint75 AND b.jdt >= @dtint75 - @offset AND DATEDIFF(a.dt,b.jdt) <=75 THEN 75
WHEN b.jdt <= @dtint105 AND b.jdt >= @dtint105 - @offset AND DATEDIFF(a.dt,b.jdt) <=105 THEN 105
WHEN b.jdt <= @dtint135 AND b.jdt >= @dtint135 - @offset AND DATEDIFF(a.dt,b.jdt) <=135 THEN 135
WHEN b.jdt <= @dtint165 AND b.jdt >= @dtint165 - @offset AND DATEDIFF(a.dt,b.jdt) <=165 THEN 165
WHEN b.jdt <= @dtint195 AND b.jdt >= @dtint195 - @offset AND DATEDIFF(a.dt,b.jdt) <=195 THEN 195
WHEN b.jdt <= @dtint225 AND b.jdt >= @dtint225 - @offset AND DATEDIFF(a.dt,b.jdt) <=225 THEN 225
ELSE 'other' 
END AS 'period',

SUM(CASE WHEN a.type = 'JN' AND a.paid = 'Y' AND a.upgraded=0 THEN 1 ELSE 0     END) AS 'Paid_Joins',
SUM(CASE WHEN a.type IN ('SL','RL') AND ttype !='Purchase' THEN (a.amt_usd/100 - a.vat_usd/100) END) AS 'Revenue_Amount'

FROM __customer b
JOIN  __transaction a on b.uid = a.primary_uid 

WHERE a.dt <= @date

GROUP BY 1,2

edited Aug 11 '17 at 00:33

answered Aug 10 '17 at 00:02

LAS

829
5
7

Thanks. However, I tried to have all the date hard-coded but it doesn't looks like the query time get improved at all. – smzsmzsmz Aug 10 '17 at 00:06
How many rows are processed before the aggregate (a.dt < @date)? I ask because the case statement is performed once per row, then the rows are aggregated. Removing hard-coded dates won't solve that. Coding it outside the case statement as I've shown will. – LAS Aug 10 '17 at 00:50
It's not the hard-coding of the date, it's the function call date_sub I'm trying to remove from the select iteration. – LAS Aug 10 '17 at 02:05
Added more conditions to the where clause to reduce number of rows being processed. – LAS Aug 10 '17 at 11:14
keep using the EXPLAIN output for each query variant. try defining a **composite index primary_uid, dt** on table __transaction it is the "using temporary" & "using filesort" you want to remove from the explain output. Suggest you also try reversing the order of tables in the query: FROM __transaction t JOIN __customer c ON c.uid = t.primary_uid *(ps: I dislike aliases that imply an order like a,b,c)* – Paul Maxwell Aug 10 '17 at 12:50
There are about 5m rows are processed as you can see it in the explain output. Since the starting date of the transaction is dynamically depending on the "date" as well as "offset" arguments, the only way I can limit the the starting date is by putting a dynamic calculation but it doesn't seem like helping the situation here. – smzsmzsmz Aug 10 '17 at 18:19
Used_By_Already: I reversed the order of joins but it's the same thing. haha, the alias is just for testing query, and the production query is using the actual table alias. – smzsmzsmz Aug 10 '17 at 18:22
You need to use `@` to address another user so him get a notification: @Used_By_Already The order of the `JOIN` doesn't affect the performance – Juan Carlos Oropeza Aug 10 '17 at 18:47
But does it affect the explain plan? Have you created the composite index? It is "using temporary" and "using filesort" that you need to remove from the explain output. Use explain.... it is your performance tuning friend. – Paul Maxwell Aug 10 '17 at 21:35
@LAS I will try to put all calculation outside to see if it helps, and I will let you know how it goes – smzsmzsmz Aug 10 '17 at 23:10

score 0 · Answer 2 · answered Aug 10 '17 at 13:35

What it LOOKS like you are trying to do is break down account activity into bucket ranges for historical transactions for customers. However, looking at your date tests, it looks like each bucket is essentially 2 days such as

5 days results in 5/11-5/12
13 days results in 5/3-5/4
25 days results in 4/21-4/22

instead a full bucket of ex:

5/11 - 5/17
5/2  - 5/10
4/21 - 5/2
??? - 4/20

If I run with your date/interval settings a simple

SELECT DATE_SUB(@date, INTERVAL 5 DAY) - @offset   (result 20170511 looking like a number, not a date)
and
SELECT DATE_SUB(@date, INTERVAL 5 DAY) (result 2017-05-12 expected date)

So for your range would represent 2017-05-11 <= jdt AND jdt <= 2017-05-12 which covers only two days. I can only assume you want the adjustment to be based on your @offset value to better span time. This appears to be a very awkward type of query.

If you are looking for a single day, of activity, you might actually mean to do the following

SELECT DATE_SUB(@date, INTERVAL 5+@offset DAY) (result 2017-05-11  expected date)
and
SELECT DATE_SUB(@date, INTERVAL 5 DAY) (result 2017-05-12 expected date)

FOR doing date/time based columns for date range comparison, I typically do >= start date, and LESS than the NEXT day IN question, so every hour/minute/second within the day is captured

'2017-05-11' >= jdt AND jdt < '2017-05-12'

This gets you everything for only the 5/11 up to 11:59:59pm, but NOT including 5/12 date. If you meant to actually do bucket ranges as I described, I will offer a cleaner solution for you, just let me know.

In addition, for your performance, you are looking at ALL transactions prior to a given date so you are blowing through almost your entire table of transactions. What other "types" of transactions are within the table that could be discarded via an index helping? It appears you only care about 'JN', 'SL' and 'RL'. That said, I would have an index on your transaction table on ( type, dt ). For your customer table, I would have a covering index on ( uid, act ) so it does not need to go to all the page data of the customer to retrieve their account. It qualifies the join based on the UID, and the account # comes along for the ride.

Also I have added composite index (type,dt) on transaction and (uid,act) on customer, but the explain output still shows the same. — smzsmzsmz, Aug 10 '17 at 18:10
I used single day just to test the performance but each buckets' date range could be changing dynamically and they are depending on the two argument 'date' and 'offset' — smzsmzsmz, Aug 10 '17 at 23:04

Need Help on Optimizing an Interesting MySQL Query

Query Optimization

UPDATE

UPDATE2

2 Answers2