14

I need an SQL guru to help me speed up my query.

I have 2 tables, quantities and prices. quantities records a quantity value between 2 timestamps, 15 minutes apart. prices records a price for a given timestamp, for a given price type and there is a price 5 record for every 5 minutes.

I need 2 work out the total price for each period, e.g. hour or day, between two timestamps. This is calculated by the sum of the (quantity multiplied by the average of the 3 prices in the 15 minute quantity window) in each period.

For example, let's say I want to see the total price each hour for 1 day. The total price value in each row in the result set is the sum of the total prices for each of the four 15 minute periods in that hour. And the total price for each 15 minute period is calculated by multiplying the quantity value in that period by the average of the 3 prices (one for each 5 minutes) in that quantity's period.

Here's the query I'm using, and the results:

SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

+---------------------+---------------------+----------+
| start               | end                 | total    |
+---------------------+---------------------+----------+
| 2010-07-01 00:00:00 | 2010-07-01 01:00:00 | 0.677733 |
| 2010-07-01 01:00:00 | 2010-07-01 02:00:00 | 0.749133 |
| 2010-07-01 02:00:00 | 2010-07-01 03:00:00 | 0.835467 |
| 2010-07-01 03:00:00 | 2010-07-01 04:00:00 | 0.692233 |
| 2010-07-01 04:00:00 | 2010-07-01 05:00:00 | 0.389533 |
| 2010-07-01 05:00:00 | 2010-07-01 06:00:00 | 0.335300 |
| 2010-07-01 06:00:00 | 2010-07-01 07:00:00 | 1.231467 |
| 2010-07-01 07:00:00 | 2010-07-01 08:00:00 | 0.352800 |
| 2010-07-01 08:00:00 | 2010-07-01 09:00:00 | 1.447200 |
| 2010-07-01 09:00:00 | 2010-07-01 10:00:00 | 0.756733 |
| 2010-07-01 10:00:00 | 2010-07-01 11:00:00 | 0.599467 |
| 2010-07-01 11:00:00 | 2010-07-01 12:00:00 | 1.056467 |
| 2010-07-01 12:00:00 | 2010-07-01 13:00:00 | 1.252600 |
| 2010-07-01 13:00:00 | 2010-07-01 14:00:00 | 1.285567 |
| 2010-07-01 14:00:00 | 2010-07-01 15:00:00 | 0.442933 |
| 2010-07-01 15:00:00 | 2010-07-01 16:00:00 | 0.692567 |
| 2010-07-01 16:00:00 | 2010-07-01 17:00:00 | 1.281067 |
| 2010-07-01 17:00:00 | 2010-07-01 18:00:00 | 0.652033 |
| 2010-07-01 18:00:00 | 2010-07-01 19:00:00 | 1.721900 |
| 2010-07-01 19:00:00 | 2010-07-01 20:00:00 | 1.362400 |
| 2010-07-01 20:00:00 | 2010-07-01 21:00:00 | 1.099300 |
| 2010-07-01 21:00:00 | 2010-07-01 22:00:00 | 0.646267 |
| 2010-07-01 22:00:00 | 2010-07-01 23:00:00 | 0.873100 |
| 2010-07-01 23:00:00 | 2010-07-02 00:00:00 | 0.546533 |
+---------------------+---------------------+----------+
24 rows in set (5.16 sec)

I need the query to run a lot faster than this, and would have though it would be possible. Here's the results from EXPLAIN EXTENDED ...

+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| id | select_type        | table      | type  | possible_keys     | key             | key_len | ref   | rows  | Extra                                        |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
|  1 | PRIMARY            | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 | Using where                                  |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
2 rows in set, 3 warnings (0.00 sec)

I noticed the dependent sub query is not using the timestamp field in the key and the query is scanning loads of rows.

Can anyone help me get this running a hell of a lot faster?

Here are the SQL statements required to create the schema and fill it with a lot of data (2 months worth)

# Create prices table

CREATE TABLE `prices` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `timestamp` datetime NOT NULL,
  `type_id` int(11) NOT NULL,
  `price` float(8,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `timestamp` (`timestamp`),
  KEY `type_id` (`type_id`)
) ENGINE=MyISAM;

# Create quantities table

CREATE TABLE `quantities` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `start_timestamp` datetime NOT NULL,
  `end_timestamp` datetime NOT NULL,
  `quantity` float(7,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_timestamp` (`start_timestamp`),
  KEY `end_timestamp` (`end_timestamp`)
) ENGINE=MyISAM;

# Insert first 2 rows into prices, one for each of 2 types, starting 64 days ago

INSERT INTO `prices` (`id`, `timestamp`, `type_id`, `price`) VALUES
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '1', RAND()),
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '2', RAND());

# Fill the prices table with a record for each type, for every 5 minutes, for the next 64 days

INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 32 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 16 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 8 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 4 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 2 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 1 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 12 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 6 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 3 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 90 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 45 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 20 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 10 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_SUB(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices WHERE MOD( (TIME_TO_SEC( `timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 *60 ) = 0 AND `timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');

# Insert first row into quantities, start timestamp is 64 days ago, end timestamp is start timestamp plus 15 minutes

INSERT INTO `quantities` (`id`, `start_timestamp`, `end_timestamp`, `quantity`) VALUES (NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), DATE_SUB(CURDATE(), INTERVAL '63 23:45' DAY_MINUTE), RAND());

# Fill the quantities table with a record for each 15 minute period for the next 64 days

INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 32 DAY), DATE_ADD(`end_timestamp`, INTERVAL 32 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 16 DAY), DATE_ADD(`end_timestamp`, INTERVAL 16 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 8 DAY), DATE_ADD(`end_timestamp`, INTERVAL 8 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 4 DAY), DATE_ADD(`end_timestamp`, INTERVAL 4 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 2 DAY), DATE_ADD(`end_timestamp`, INTERVAL 2 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 1 DAY), DATE_ADD(`end_timestamp`, INTERVAL 1 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 12 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 12 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 6 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 6 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 3 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 3 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 90 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 90 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 45 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 45 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 15 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities;
INSERT INTO quantities (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_SUB(`start_timestamp`, INTERVAL 15 MINUTE), DATE_SUB(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities WHERE MOD( (TIME_TO_SEC( `start_timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 * 60 ) = 0 AND `start_timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
neilcrookes
  • 4,203
  • 2
  • 21
  • 27
  • The only way to improve performance is to rewrite subquery as JOIN. – Naktibalda Jul 22 '10 at 10:39
  • I think that's what JochenJung has recommended below too, but we are having trouble getting it to generate the correct results and it's still taking the same amount of time? Do you have any ideas Naktibalda? – neilcrookes Jul 22 '10 at 10:55
  • @Naktibalda, this is not true; we are talking mysql here - in certain cases, such as the above you will have to help the query planer by deliberately making subqueries (not correlated... I'll give an answer) – Unreason Jul 22 '10 at 11:14

3 Answers3

7

Here is my first attempt. This one is dirty and uses the following properties on data:

  • there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
  • notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
  • it is also not flexible to changes in periods

Query 1:

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum((p1.price + p2.price + p3.price)/3*q.quantity) as total 
FROM 
    quantities q join 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join 
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join 
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE keyword which does not reuse the results, but on a warm database).

EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / ( 
         3 -
         COALESCE(p1.price-p1.price,1) - 
         COALESCE(p2.price-p2.price,1) - 
         COALESCE(p3.price-p3.price,1)
        )
       *q.quantity) as total 
FROM 
    quantities q LEFT JOIN 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine

SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE 
    `prices`.`timestamp` >= '2010-07-01 00:00:00' 
    AND `prices`.`timestamp` < '2010-07-02 00:00:00' 
    AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
    AND `prices`.`timestamp` < `quantities`.`end_timestamp`
    AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).

Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2

SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM 
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND 
                        p.timestamp < q.end_timestamp AND
                        p.timestamp >= '2010-07-01 00:00:00' AND 
                        p.`timestamp` < '2010-07-02 00:00:00' 
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00' 
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );

NOTE: Do not forget to remove sql_no_cache keywords in production.

There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.

So in general, I apply the following principle to set my expectations regarding the performance of a query:

  • if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)

In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.

Unreason
  • 12,556
  • 2
  • 34
  • 50
  • you are a legend, thanks very much. Query 2 returns perfect results in 0.0039 sec and Query 3 also returns perfect results in 0.1655 sec – neilcrookes Jul 22 '10 at 14:27
  • @neilcrookes, You are welcome. Can you confirm that Query 2 runs faster then Query 3 on your machine? (Initially there was unmarked Query 1A, which I properly marked now. Also you should allow DB to worm up indexes, I usually run queries with `sql_no_cache` a few times for benchmarking). – Unreason Jul 22 '10 at 14:33
  • (could no longer edit first comment, so creating a new one), you are a legend, thanks very much. Query 1a returns perfect results in 0.0039 sec and Query 2 also returns perfect results in 0.1655 sec. Query 3 suffers the same issue as @gnarf's query in that it doesn't return a row where there are no prices in that hour and the start and end times correspond to the earliest and latest price records in that hour, but returns in 0.0144 sec. Query 1a is the winner. Thanks again. You're a life saver. – neilcrookes Jul 22 '10 at 14:48
  • @neilcrookes, just for completeness sake - I have a bug in Query 3, `LEFT JOIN` servers no purpose if the query is later testing for `p.type_id = 1` because it filters out the rows with NULLs (that's why it drops those rows). However, correcting this error by putting in `(p.type_id = 1 OR p.type_id IS NULL)` slows down the query to ~0.2 sec. Will not edit the answer. – Unreason Jul 22 '10 at 15:30
2

This should return the same results and perform slightly faster:

SELECT
  MIN( `quantities`.`start_timestamp` ) AS `start`,
  MAX( `quantities`.`end_timestamp` ) AS `end`,
  SUM( `quantities`.`quantity` * `prices`.`price` ) 
   * COUNT(DISTINCT `quantities`.`id`) 
   / COUNT(DISTINCT `prices`.`id`)
    AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
  AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

Since you can't calculate AVG() inside the SUM(), I had to do some interesting COUNT(DISTINCT) to calculate the number of prices returned per quantities. I'm wondering if this gives you the same results with "real" data...

Using JOIN:

+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys                 | key             | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8       | NULL |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ALL   | timestamp,type_id             | NULL            | NULL    | NULL | 36862 |    62.20 | Using where; Using join buffer               |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+

vs. the same query only adding LEFT to the JOIN

+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys     | key             | key_len | ref   | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 |   100.00 |                                              |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+

Interesting that LEFT can completely removes the end_timestamp as a possible key, and changes the selected keys so much, making it take 15 times as long...

This reference page could help you out a little more if you want to look at specifying index hints for your JOINS

gnarf
  • 105,192
  • 25
  • 127
  • 161
  • +1 This is good, also adding composite indexes on (start_timestamp, end_timestamp) and on (type_id, timestamp) should help. However, I think I'll be able to bring it down to ~0.01 sec – Unreason Jul 22 '10 at 11:17
  • @Unreason --- *scratches head* you say +1, but noone has voted yet ;) --- `` --- I'm interested in seeing how you get it down that far! – gnarf Jul 22 '10 at 11:27
  • Thanks gnarf, this is almost spot on. It is running in about 0.4 secs on my machine but the results are different to my original query. The reason I think is because you are dividing by COUNT(`prices`.`price`), which with the GROUP clause and this data will be 4 quantity rows * 3 price rows = 12, but if you divide by 3 then it generates the same results as my original query. Trouble is I don't want to hard code 3 in the query, but I can't figure out what the SQL is to derive that value from the data. Once that part is sorted, it'll all be perfect. Any ideas much appreciated? – neilcrookes Jul 22 '10 at 11:43
  • @neilcrookes - The query is now giving me the same results with the test data... But that doesn't mean its the same calculation... In fact if you delete only one price from that time range, the calcuation returns different results from your query. This answer doesn't quite work... – gnarf Jul 22 '10 at 12:18
  • @gnarf, using your latest query I tested it with deleting a whole hours worth of consecutive prices, 45 mins worth, 30 mins worth, 15 mins worth and just 5 mins worth and I get the almost the same results. The differences were, in my original I do get a row for the hour with no prices whatsoever with a total of NULL, but I don't get a row from your query. Also in your query the MIN and MAX start and end timestamps are ones where I still have prices. However, most totals match apart from the hour where I deleted only one of the prices - in this case there was a very small discrepancy. – neilcrookes Jul 22 '10 at 13:06
  • I tried your query with a LEFT JOIN with the dataset with the prices missing as described in the comment above and the results had the NULL row where I deleted all prices in the hour, and the start and end columns were the same as my query, but the totals for the hours where I had deleted prices were not the same, and it took 5 seconds! :-( But I'm wondering whether it could be made accurate by adjusting the COUNT(DISTINCT `quantities`.`id`) & COUNT(DISTINCT `prices`.`id`) clauses to exlude NULL rows or something? And speeded up somehow? – neilcrookes Jul 22 '10 at 13:13
  • @neil, ah you can have nulls... just posted the answer that runs fast but will have to do left joins and COALESCE to compensate for NULLS. Also one possible problem with neils query is that COUNT(DISTINCT ...) will fail on the prices which remain the same as those will be counted only once. I will also, time permitting write a more general query still aiming for high speed. – Unreason Jul 22 '10 at 13:24
  • @gnarf, check my answer, as promised ~0.01 sec; also some notes on performance expectancy. – Unreason Jul 22 '10 at 14:36
0

Remember, just because you have indexes on your columns doesn't necessarily mean they'll run faster. As it stands, the index created is for each individual column, which, if you were only limiting the data on one column, would return results quite fast.

So to try and avoid "Using filesort" (which you need to do as much as possible), maybe try the following index:

CREATE INDEX start_timestamp_end_timestamp_id ON quantities (start_timestamp,end_timestamp,id);

And something similar for the prices table (combining the 3 individual indexes you have into 1 index for faster lookup)

An excellent resource which explains it in great detail and how to optimize your indexes (and what the different Explain's mean, and what to aim for) is: http://hackmysql.com/case1

AcidRaZor
  • 566
  • 4
  • 9
  • Thanks AcidRaZor, but adding this index and one on the prices table didn't do much to improve performance on either my original query or the one that @gnarf has suggested. – neilcrookes Jul 22 '10 at 12:04
  • It was worth a shot :) However, I'd still recommend reading through the website I quoted as well. They go into more detail as to how you could enhance performance with your queries – AcidRaZor Jul 22 '10 at 12:06