MySQL's built in cache really makes this question moot for most of the day, but the very first time the following query is run, the performance is terrible: Taking over 300 seconds the first time whereas subsequent querying can complete in milliseconds. Running this with SQL_NO_CACHE then takes 2-4 seconds (!) which is very acceptable in this instance -- but the initial run-time is not.
SELECT DATEDIFF( bt.`datetime`, st.`datetime`) AS 'day_separation'
FROM `smallerTable` AS st
LEFT OUTER JOIN `bigTable` AS bt ON bt.item_id = st.item_id
AND bt.code = 'X'
AND bt.`datetime` > st.`datetime`
AND DATEDIFF ( bt.datetime, st.datetime) < 11
AND st.`datetime` > '2012-07-01' AND st.`datetime` < 'yesterdays-date 23:59:59'
I have introduced multi-column indexes (thanks to this question) but it still could not address this particular problem. This solution looks inspired but I don't think it is applicable since I'm not sure how I could union these results.
The smaller table has ~8000 records and I want to count / include all of them right now. It will eventually grow bigger and contain items prior to 2012-07-01.
bigTable has 10 million records and I only want to match the "pairing" of those records to the smaller table. Part of the trouble is that they cannot share a direct key or reference linking them together so I am left with a LEFT OUTER JOIN
and guessing that if the timestamp of the two events are < 11 days apart (and share the other conditions) that they they must be related.
Excluding the test DATEDIFF ( bt.datetime, st.datetime) < 11
created 14k 'results' illustrating that the number of DATEDIFF
calculations that "need to occur" is 14k-8k (a.k.a. 6k).
INDEXES: the datetime
fields of each table, the code
, and the item_id
s.
I have compound indexes on both tables in the order of (item_id, datetime)
. From my understanding, that is the necessary order because we use the datetime fields in the select statement in the form of DATEDIFF( bt.datetime, st.datetime)
.
Would a combined index on (code, item_id, datetime)
revolutionize this query? (Yes it did!)
The explain reveals little to my untrained eye other than that it is using a temporary table which I understand can be time-consuming.
id * select_type * table * type * possible_keys * key * key_len * ref * rows * extra
1 * SIMPLE * st * index * NULL * items_for_datetime * 59 * NULL * 8295 * using index; using temporary; using filesort
1 * SIMPLE * BT * ref * [many] * items_for_datetime * 51 * master.st.item_id * 3 *
Depending on MySQL's whims, bigTable sometimes shows that it prefers the item_id
key over items_for_datetime
. Should I encourage the use of my joint index believing that I know better?
Some extra info:
- The inserts into these tables occur once each day (1~5k records into BT)
- No updates or deletions ever occur
- I could probably run two queries -- Change this one to INNER JOIN and then run a second one to subtract the number of results from the total records to find the number that didn't have a corresponding result in BT
- We have already executed phpmyadmin's
Check Table
,Defragmentation
, andOptimize Table
on BT
[aside] Could this be a good scenario for using a NoSQL database such as Mongo?
Why is there be such a disparity on the first run and the second? More importantly: What can be done to improve the timing of the first run?
Update: New attempts require a new day to find out their efficacy. Tomorrow I will attempt Barmar's suggestion using BETWEEN
and DATE_ADD
. I have also created a combined index on (code, item_id, datetime)
. I will report back tomorrow the result but welcome any other ideas.
Update: Success! The first run of the query now only spent 6 seconds which is amazing considering where it came from. Subsequent querying took only .035 seconds! What a dream. The combined index on (code, item_id, datetime)
no doubt had a hand in this success. Here is the new query: Thanks everyone!
SELECT DATEDIFF( bt.`datetime`, st.`datetime` ) AS 'day_separation'
FROM `smallerTable` AS st
LEFT OUTER JOIN bigTable AS bt USE INDEX ( `cmd_item_time` )
ON bt.item_id = st.item_id
AND bt.code = 'X'
AND bt.`datetime` BETWEEN st.`datetime` AND DATE_ADD( st.`datetime`, INTERVAL 10 DAY )
AND st.datetime BETWEEN '2012-07-01' AND 'yesterdays-date 23:59:59'