MySQL; LEFT OUTER joining on DATEDIFF with indexes taking 300+ seconds (but only the first time)

Question

MySQL's built in cache really makes this question moot for most of the day, but the very first time the following query is run, the performance is terrible: Taking over 300 seconds the first time whereas subsequent querying can complete in milliseconds. Running this with SQL_NO_CACHE then takes 2-4 seconds (!) which is very acceptable in this instance -- but the initial run-time is not.

SELECT DATEDIFF( bt.`datetime`, st.`datetime`) AS 'day_separation'
FROM `smallerTable` AS st
LEFT OUTER JOIN `bigTable` AS bt ON bt.item_id = st.item_id
  AND bt.code = 'X'
  AND bt.`datetime` > st.`datetime`
  AND DATEDIFF ( bt.datetime, st.datetime) < 11
  AND st.`datetime` > '2012-07-01' AND st.`datetime` < 'yesterdays-date 23:59:59'

I have introduced multi-column indexes (thanks to this question) but it still could not address this particular problem. This solution looks inspired but I don't think it is applicable since I'm not sure how I could union these results.

The smaller table has ~8000 records and I want to count / include all of them right now. It will eventually grow bigger and contain items prior to 2012-07-01.

bigTable has 10 million records and I only want to match the "pairing" of those records to the smaller table. Part of the trouble is that they cannot share a direct key or reference linking them together so I am left with a LEFT OUTER JOIN and guessing that if the timestamp of the two events are < 11 days apart (and share the other conditions) that they they must be related.

Excluding the test DATEDIFF ( bt.datetime, st.datetime) < 11 created 14k 'results' illustrating that the number of DATEDIFF calculations that "need to occur" is 14k-8k (a.k.a. 6k).

INDEXES: the datetime fields of each table, the code, and the item_ids.

I have compound indexes on both tables in the order of (item_id, datetime). From my understanding, that is the necessary order because we use the datetime fields in the select statement in the form of DATEDIFF( bt.datetime, st.datetime).

Would a combined index on (code, item_id, datetime) revolutionize this query? (Yes it did!)

The explain reveals little to my untrained eye other than that it is using a temporary table which I understand can be time-consuming.

id * select_type * table * type  * possible_keys * key                * key_len * ref           * rows * extra
1  * SIMPLE      * st    * index * NULL          * items_for_datetime * 59      * NULL          * 8295 * using index; using temporary; using filesort
1  * SIMPLE      * BT    * ref   * [many]        * items_for_datetime * 51      * master.st.item_id * 3    *

Depending on MySQL's whims, bigTable sometimes shows that it prefers the item_id key over items_for_datetime. Should I encourage the use of my joint index believing that I know better?

Some extra info:

The inserts into these tables occur once each day (1~5k records into BT)
No updates or deletions ever occur
I could probably run two queries -- Change this one to INNER JOIN and then run a second one to subtract the number of results from the total records to find the number that didn't have a corresponding result in BT
We have already executed phpmyadmin's Check Table, Defragmentation, and Optimize Table on BT

[aside] Could this be a good scenario for using a NoSQL database such as Mongo?

Why is there be such a disparity on the first run and the second? More importantly: What can be done to improve the timing of the first run?

Update: New attempts require a new day to find out their efficacy. Tomorrow I will attempt Barmar's suggestion using BETWEEN and DATE_ADD. I have also created a combined index on (code, item_id, datetime). I will report back tomorrow the result but welcome any other ideas.

Update: Success! The first run of the query now only spent 6 seconds which is amazing considering where it came from. Subsequent querying took only .035 seconds! What a dream. The combined index on (code, item_id, datetime) no doubt had a hand in this success. Here is the new query: Thanks everyone!

SELECT DATEDIFF( bt.`datetime`, st.`datetime` ) AS  'day_separation'
FROM  `smallerTable` AS st
LEFT OUTER JOIN bigTable AS bt USE INDEX (  `cmd_item_time` ) 
ON bt.item_id = st.item_id
  AND bt.code =  'X'
  AND bt.`datetime` BETWEEN st.`datetime` AND DATE_ADD( st.`datetime`, INTERVAL 10 DAY ) 
  AND st.datetime BETWEEN '2012-07-01' AND  'yesterdays-date 23:59:59'

Indexes on the datetime fields aren't going to help, because you're doing your decision logic based on the result of a function call, not on the underlying fields. If it were possible, you'd have to put an index on `datediff(bt.datetime, st.datetime)`. But you cannot index "derived" results, especially like this, because mysql would have to pregenerate the diff results for EVERY bt/st combination in the table. — Marc B, Apr 15 '13 at 14:46
Interesting comment, @MarcB; and would it benefit me to personally generate every bt/st combination -- assuming all the other factors are in agreement? There is _very little_ duplication of `item_id`s as indicated by there only being 6k extra items when I removed the `DATEDIFF` conditional — veeTrain, Apr 15 '13 at 14:53

Barmar · Accepted Answer · 2013-04-15T15:43:57.153

1

Try changing:

AND bt.`datetime` > st.`datetime`
AND DATEDIFF ( bt.datetime, st.datetime) < 11

to:

AND bt.`datetime` BETWEEN st.`datetime` AND date_add(st.`datetime`, interval 11 day)

This may allow the index on bt.datetime to be used.

If code = 'X' filters out a large portion of bigTable, a compound index on (code, item_id) should help.

edited Apr 15 '13 at 15:43

answered Apr 15 '13 at 14:58

Barmar

741,623
53
500
612

Thank you Barmar; I will definitely try it out! That looks really cool/creative. Do you have any suggestions about making an 'uber' combined index? – veeTrain Apr 15 '13 at 14:59
Barmar, the `EXPLAIN` just told me it plans on using the `item_id` index. Since I only get "one go" of this query per day, would you suggest I just try it or try to force a different index? And, do you think [both of these](http://stackoverflow.com/questions/1128736/mysql-date-add/1128742#1128742) would use the `datetime` index equally as well? – veeTrain Apr 15 '13 at 15:16
I guess that means that it can't use the `datetime` index for this type of comparison. – Barmar Apr 15 '13 at 15:22
Aye; when I tried to `USE INDEX (\`datetime\`)` it replied that it would have to do a full table scan and could not apparently use it...does that mean that my combined indexes that include `datetime` will also fail? Should I just leave them out? – veeTrain Apr 15 '13 at 15:30
I think that confirms that it's not able to use the datetime index. See my suggestion of a compound `(code, item_id)` index. – Barmar Apr 15 '13 at 15:44
Congratulations, Barmar; because of your assistance and my research the troublesome query just took 6 seconds to execute - in total! An amazing improvement. The query you mention does allow for indexing on the datetimes as mentioned [here](https://www.informit.com/articles/article.aspx?p=377652&seqNum=2) which I 'accidentally' found yesterday. Search for DATE_ADD. My new index of `(code, item_id, datetime)` is successful and is being chosen by MySQL even without my query hint. Thanks so much! – veeTrain Apr 16 '13 at 12:41

score 0 · Answer 2 · answered Apr 15 '13 at 14:46

0

The problem of you query is most probably that line : AND st.datetime > '2012-07-01' AND st.datetime < 'yesterdays-date 23:59:59'

By casting datetime to string (in order to compare) , you're loosing the advantages of indexes...

answered Apr 15 '13 at 14:46

Laurent S.

6,816
2
28
40

wrong. `'2012-07-01'` will be converted to native mysql datetime. it won't be datetime being convered to string. – Marc B Apr 15 '13 at 14:47
`yesterdays-date` just means for today it would execute as `2013-04-14` – veeTrain Apr 15 '13 at 14:47
I think putting quotes around the dates is necessary and acceptable. [This answer](http://stackoverflow.com/a/95256/469643) is similar and might be something I try – veeTrain Apr 15 '13 at 14:58

MySQL; LEFT OUTER joining on DATEDIFF with indexes taking 300+ seconds (but only the first time)

2 Answers2