4

I am running a filtered down aggregating query and would like a little feedback on how to obtain better response times for the query.

The Query (runs, but averages 400+ seconds):

select data_date,sum(closeprice) from moving_avgs
where
    symbol in (select distinct symbol from moving_avgs
                where
                ma200_close     >= 5.00 and
                ma200_volume    >= 400000 and
                data_date   = (select min(data_date) from moving_avgs
                                where year(data_date) = 2007) 
                )
group by data_date;

My EXPLAIN Query reads (formatted to read in this environment):

id:         1
select_type:    PRIMARY
table:      moving_avgs
type:       ALL
possible_keys:  NULL
key:        NULL
key_len:        NULL
ref:        NULL
rows:       6250033
Extra:      Using where; Using temporary; Using filesort

id:         2
select_type:    DEPENDENT SUBQUERY
table:      moving_avgs
type:       unique_subquery
possible_keys:  PRIMARY,symbol,data_date,ma200_close,ma200_volume
key:        PRIMARY
key_len:        29
ref:        func,const
rows:       1
Extra:      Using where

id:         3
select_type:    SUBQUERY
table:      moving_avgs
type:       index
possible_keys:  NULL
key:        data_date
key_len:        3
ref:        NULL
rows:       6250033
Extra:      Using where; Using index

My my.ini [mysqld] & [myisamchk] sections read (running on a 4GB dual-processor AMD laptop):

[mysqld]
port        = 3306
socket      = /tmp/mysql.sock
skip-external-locking
key_buffer_size = 512M
max_allowed_packet = 20M
table_open_cache = 256
sort_buffer_size = 8M
read_buffer_size = 8M
read_rnd_buffer_size = 16M
myisam_sort_buffer_size = 256M
thread_cache_size = 8
query_cache_size= 132M
basedir=c:/wamp/bin/mysql/mysql5.5.24
log-error=c:/wamp/logs/mysql.log
datadir=c:/wamp/bin/mysql/mysql5.5.24/data
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 8

[myisamchk]
key_buffer_size = 256M
sort_buffer_size = 256M
read_buffer = 4M
write_buffer = 4M

Thanks!

pjumble
  • 16,880
  • 6
  • 43
  • 51
Dr.EMG
  • 159
  • 10
  • **You need to show us the table and index definitions.** Diagnosing slow queries requires full table and index definitions, not just a description or paraphrase. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. – Andy Lester Dec 22 '12 at 05:23

4 Answers4

4

Could you list the result of a SHOW CREATE TABLE?

Also could you try this variant and see how long it takes:

SELECT  
    data_date,  
    sum(closeprice)  
FROM moving_avgs  
INNER JOIN  
(  
    SELECT distinct symbol  
    FROM moving_avgs  
    WHERE    
        ma200_close     >= 5.00 and    
        ma200_volume    >= 400000 and  
        data_date   =  
        (  
            SELECT min(data_date)  
            FROM moving_avgs  
            WHERE year(data_date) = 2007  
        )   
) symbols ON symbols.symbol = moving_avgs.symbol  
GROUP BY data_date;  

I suspect three source of slowness (in combination or separately). The reasoning behind the first two is pretty straightforward:

(1) The indexes on your table may not be designed as well as they could be. I'm not seeing good index usage in your EXPLAIN info.

(2) The way the subquery is designed in the WHERE might be forcing the engine to not use the index you have on 'symbol'--a loss of the performance the index might otherwise give you. The EXPLAIN output makes it look like this loss is the case.

(3) Another way of stating (2) without talking about the roll of indices is that the engine is possibly working inefficiently with the main subquery (the subquery in the WHERE) based on incorrectly deducing a relationship to the outer query (i.e., it thinks there is a relationship--that your query is a correlated subquery--and it's making a bad choice about that relationship).

[Note: The way your WHERE is written, the subquery is NOT a correlated query and it could be executed efficiently, and the IN could be resolved relatively efficiently (although possibly without benefit of an index); however, the engine might be interpreting this situation poorly--you do have a somewhat complicated nested sub-query situation that might make the engine get it wrong].

In any case, moving the subquery to a join could fix such a scenario, because it eliminates any possibility of the engine trying to inefficiently relate the sub-query to the rest of your query. When the subquery is the source for a join, the engine has to resolve it before the rest of the body of your query can be considered. This eliminates any bad deductions about a relationship between subquery and rest of the query that the engine might be making.

DWright
  • 9,258
  • 4
  • 36
  • 53
  • Here is my CREATE TABLE KEYS ( I too have focused on my indexes ). Cannot fix the entire CREATE TABLE ddl... `PRIMARY KEY (symbol,data_date), KEY symbol (symbol), INDEX data_date (data_date), INDEX closeprice (closeprice), INDEX ma200_close (ma200_close), INDEX ma200_volume (ma200_volume)` – Dr.EMG Dec 22 '12 at 04:55
  • And how did that join-based variant perform? – DWright Dec 22 '12 at 05:05
  • The result set is now returned in less than 70 seconds. I have no idea if this is good or not, for a 6.2 million row database. Any other suggestions on indexes would be greatly appreciated. – Dr.EMG Dec 22 '12 at 05:21
2

I suspect that this condition:

(select min(data_date) from moving_avgs
                            where year(data_date) = 2007)

is going to be expensive because it's going to calculate the year for every row, and it's not going to be able to use any index that might be on data_date (which we don't know because you haven't yet shown us table and index definitions).

If there is an index on data_date, then you can let MySQL use the index by changing that to

(select min(data_date) from moving_avgs
where data_date between '01-01-2007' and '12-31-2007')

Note that that might not be the way that MySQL specifies dates, but you get the idea. You're giving it a start and end point, which lets it use the index. That's not possible where you're asking it to calculate the year on every row.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • 1
    Andy, this makes a lot of sense. I will try this too. – Dr.EMG Dec 22 '12 at 05:31
  • 1
    By using your suggestion along with the other changes mentioned above, the query result set now comes back in a little more than 30 secs. Thanks! – Dr.EMG Dec 22 '12 at 05:42
  • 1
    @Andy. Excellent. I missed the significance of the year(data_date). – DWright Dec 22 '12 at 05:47
  • 1
    @DWright: It's a very common problem in SQL problems posted to SO. See [this question](http://stackoverflow.com/questions/13997176/getting-mysql-result-from-the-last-30-days/13997463#13997463) from earlier today. I've been bitten so many times in my own code by not properly seeing conditions as ranges that it's now just reflex to be one of the first things I look for. – Andy Lester Dec 22 '12 at 05:55
1

By 1) creating and setting two my.ini variables:

max_heap_table_size = 256M

tmp_table_size = 512M

And, 2) increasing a third variable:

myisam_sort_buffer_size = 256M

3) deleting three single field indexes and replacing it with a four field index (INDEX: data_date-ma200_close-ma200_volume-symbol)

I was able to get the time down to 178 seconds.

Along with 4) with thanks to @DWright, by re-structuring the query, it is now down to 67 seconds.

Dr.EMG
  • 159
  • 10
0

One way I see is to pre-compute the min(data_date) for each year. That way you don't have to fire a SELECT query for each and every record of the outer query. But then you will need to maintain this table to ensure it always has the minimum data_date for a given year at any point in time.

Vaibhav Desai
  • 2,618
  • 1
  • 16
  • 16
  • I was thinking about breaking out the data_date field into their own fields (ie. data_year, data_month ) at a minimum. I do not query a specific date often but might need to in the future. What are your thoughts on this? – Dr.EMG Dec 22 '12 at 05:26