0

MySQL's built in cache really makes this question moot for most of the day, but the very first time the following query is run, the performance is terrible: Taking over 300 seconds the first time whereas subsequent querying can complete in milliseconds. Running this with SQL_NO_CACHE then takes 2-4 seconds (!) which is very acceptable in this instance -- but the initial run-time is not.

SELECT DATEDIFF( bt.`datetime`, st.`datetime`) AS 'day_separation'
FROM `smallerTable` AS st
LEFT OUTER JOIN `bigTable` AS bt ON bt.item_id = st.item_id
  AND bt.code = 'X'
  AND bt.`datetime` > st.`datetime`
  AND DATEDIFF ( bt.datetime, st.datetime) < 11
  AND st.`datetime` > '2012-07-01' AND st.`datetime` < 'yesterdays-date 23:59:59'

I have introduced multi-column indexes (thanks to this question) but it still could not address this particular problem. This solution looks inspired but I don't think it is applicable since I'm not sure how I could union these results.

The smaller table has ~8000 records and I want to count / include all of them right now. It will eventually grow bigger and contain items prior to 2012-07-01.

bigTable has 10 million records and I only want to match the "pairing" of those records to the smaller table. Part of the trouble is that they cannot share a direct key or reference linking them together so I am left with a LEFT OUTER JOIN and guessing that if the timestamp of the two events are < 11 days apart (and share the other conditions) that they they must be related.

Excluding the test DATEDIFF ( bt.datetime, st.datetime) < 11 created 14k 'results' illustrating that the number of DATEDIFF calculations that "need to occur" is 14k-8k (a.k.a. 6k).

INDEXES: the datetime fields of each table, the code, and the item_ids.

I have compound indexes on both tables in the order of (item_id, datetime). From my understanding, that is the necessary order because we use the datetime fields in the select statement in the form of DATEDIFF( bt.datetime, st.datetime).

Would a combined index on (code, item_id, datetime) revolutionize this query? (Yes it did!)

The explain reveals little to my untrained eye other than that it is using a temporary table which I understand can be time-consuming.

id * select_type * table * type  * possible_keys * key                * key_len * ref           * rows * extra
1  * SIMPLE      * st    * index * NULL          * items_for_datetime * 59      * NULL          * 8295 * using index; using temporary; using filesort
1  * SIMPLE      * BT    * ref   * [many]        * items_for_datetime * 51      * master.st.item_id * 3    *

Depending on MySQL's whims, bigTable sometimes shows that it prefers the item_id key over items_for_datetime. Should I encourage the use of my joint index believing that I know better?

Some extra info:

  • The inserts into these tables occur once each day (1~5k records into BT)
  • No updates or deletions ever occur
  • I could probably run two queries -- Change this one to INNER JOIN and then run a second one to subtract the number of results from the total records to find the number that didn't have a corresponding result in BT
  • We have already executed phpmyadmin's Check Table, Defragmentation, and Optimize Table on BT

[aside] Could this be a good scenario for using a NoSQL database such as Mongo?

Why is there be such a disparity on the first run and the second? More importantly: What can be done to improve the timing of the first run?

Update: New attempts require a new day to find out their efficacy. Tomorrow I will attempt Barmar's suggestion using BETWEEN and DATE_ADD. I have also created a combined index on (code, item_id, datetime). I will report back tomorrow the result but welcome any other ideas.

Update: Success! The first run of the query now only spent 6 seconds which is amazing considering where it came from. Subsequent querying took only .035 seconds! What a dream. The combined index on (code, item_id, datetime) no doubt had a hand in this success. Here is the new query: Thanks everyone!

SELECT DATEDIFF( bt.`datetime`, st.`datetime` ) AS  'day_separation'
FROM  `smallerTable` AS st
LEFT OUTER JOIN bigTable AS bt USE INDEX (  `cmd_item_time` ) 
ON bt.item_id = st.item_id
  AND bt.code =  'X'
  AND bt.`datetime` BETWEEN st.`datetime` AND DATE_ADD( st.`datetime`, INTERVAL 10 DAY ) 
  AND st.datetime BETWEEN '2012-07-01' AND  'yesterdays-date 23:59:59'
Community
  • 1
  • 1
veeTrain
  • 2,915
  • 2
  • 26
  • 43
  • Indexes on the datetime fields aren't going to help, because you're doing your decision logic based on the result of a function call, not on the underlying fields. If it were possible, you'd have to put an index on `datediff(bt.datetime, st.datetime)`. But you cannot index "derived" results, especially like this, because mysql would have to pregenerate the diff results for EVERY bt/st combination in the table. – Marc B Apr 15 '13 at 14:46
  • Interesting comment, @MarcB; and would it benefit me to personally generate every bt/st combination -- assuming all the other factors are in agreement? There is _very little_ duplication of `item_id`s as indicated by there only being 6k extra items when I removed the `DATEDIFF` conditional – veeTrain Apr 15 '13 at 14:53

2 Answers2

1

Try changing:

AND bt.`datetime` > st.`datetime`
AND DATEDIFF ( bt.datetime, st.datetime) < 11

to:

AND bt.`datetime` BETWEEN st.`datetime` AND date_add(st.`datetime`, interval 11 day)

This may allow the index on bt.datetime to be used.

If code = 'X' filters out a large portion of bigTable, a compound index on (code, item_id) should help.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thank you Barmar; I will definitely try it out! That looks really cool/creative. Do you have any suggestions about making an 'uber' combined index? – veeTrain Apr 15 '13 at 14:59
  • Barmar, the `EXPLAIN` just told me it plans on using the `item_id` index. Since I only get "one go" of this query per day, would you suggest I just try it or try to force a different index? And, do you think [both of these](http://stackoverflow.com/questions/1128736/mysql-date-add/1128742#1128742) would use the `datetime` index equally as well? – veeTrain Apr 15 '13 at 15:16
  • I guess that means that it can't use the `datetime` index for this type of comparison. – Barmar Apr 15 '13 at 15:22
  • Aye; when I tried to `USE INDEX (\`datetime\`)` it replied that it would have to do a full table scan and could not apparently use it...does that mean that my combined indexes that include `datetime` will also fail? Should I just leave them out? – veeTrain Apr 15 '13 at 15:30
  • I think that confirms that it's not able to use the datetime index. See my suggestion of a compound `(code, item_id)` index. – Barmar Apr 15 '13 at 15:44
  • Congratulations, Barmar; because of your assistance and my research the troublesome query just took 6 seconds to execute - in total! An amazing improvement. The query you mention does allow for indexing on the datetimes as mentioned [here](https://www.informit.com/articles/article.aspx?p=377652&seqNum=2) which I 'accidentally' found yesterday. Search for DATE_ADD. My new index of `(code, item_id, datetime)` is successful and is being chosen by MySQL even without my query hint. Thanks so much! – veeTrain Apr 16 '13 at 12:41
0

The problem of you query is most probably that line : AND st.datetime > '2012-07-01' AND st.datetime < 'yesterdays-date 23:59:59'

By casting datetime to string (in order to compare) , you're loosing the advantages of indexes...

Laurent S.
  • 6,816
  • 2
  • 28
  • 40
  • wrong. `'2012-07-01'` will be converted to native mysql datetime. it won't be datetime being convered to string. – Marc B Apr 15 '13 at 14:47
  • `yesterdays-date` just means for today it would execute as `2013-04-14` – veeTrain Apr 15 '13 at 14:47
  • I think putting quotes around the dates is necessary and acceptable. [This answer](http://stackoverflow.com/a/95256/469643) is similar and might be something I try – veeTrain Apr 15 '13 at 14:58