18

I've been trying to get some speed improvements on a certain SELECT query. The situation is as follows: There is a (in my eyes) huge crossing table. It currently has about 20 million rows, but I expect this to grow alot. Based on this crossing table I need to create another table. To do so, I need to execute the following query:

SELECT hugeCrossingTable.field3, otherTable.field1, hugeCrossingTable.field2 * otherTable.field3 AS someName
FROM hugeCrossingTable 
INNER JOIN otherTable ON hugeCrossingTable.field1 = otherTable.field2

Now this currently results in about a million rows. I already have indexes on both the field1 in the 2 tables, but it still takes 18 minutes to finish.. I thought about splitting the table, but then I'd need to find a way on how to split the data, and since it's just a crossing table nothing comes to mind on how to do this.

Any ideas on how this can be optimized?

Thanks.

On request here's the create statement:

CREATE TABLE  `hugeCrossingTable` (
  `field` int(11) NOT NULL,
  `field1` int(11) NOT NULL,
  `field2` double(10,5) DEFAULT NULL,
  `field3` int(4) DEFAULT NULL,
  KEY `field1` (`field1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE  `otherTable` (
  `field` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `field1` int(10) unsigned NOT NULL,
  `field2` int(10) unsigned NOT NULL,
  `field3` decimal(5,2) NOT NULL,
  PRIMARY KEY (`field`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

And the explain output:

id, select_type, table              , type , possible_keys, key     , key_len, ref                , rows, Extra
1 , 'SIMPLE'   , 'otherTable'       , 'ALL', ''           , ''      , ''     , ''                 , 294 , ''
1 , 'SIMPLE'   , 'hugeCrossingTable', 'ref', 'field1'     , 'field1', '4'    , 'otherTable.field2', 69  , 'Using where'
lordstyx
  • 775
  • 3
  • 12
  • 29
  • 4
    Show us the `CREATE TABLE` statements, and the `EXPLAIN` output of the query. If the indexes are correct, but you have very specific ranges, `PARTITIONING` the table might help. – Wrikken Apr 05 '11 at 20:10
  • 3
    JOIN is the devil for large tables. You pretty much only want to access a table that size by an index or the primary key. Can you change the table format to suite the query, or even use a temp memory table? This can take you from minutes to ms in query time. Remember, Normalization does not imply speed. – Joseph Lust Apr 05 '11 at 20:26
  • You're loading ALL the records?? Why?? Archiving data/records is not a bad ide (if you're sure you don't need it again!) – Rudie Apr 05 '11 at 20:28
  • so you're creating a new table based on a subset of rows from another table e.g insert into target_table (x,y) select x,y from source table ? – Jon Black Apr 05 '11 at 20:30
  • @Rudie : It's all live data really. I run this query every 30 minutes. @f00 : I first take the return set through some other queries to see if they match what is needed at that moment, and then I place the result in another table. @Twisted Pear : Any suggestions on how to do this? @Wrikken : like I said, I wanted to partion the table, but since it's a crossing table I don't know how! – lordstyx Apr 05 '11 at 20:31

2 Answers2

32

Here are some innodb examples that work on large tables of approx. 60 to 500 million rows that demonstrate the advantages of a well designed innodb table and how best to use clustered indexes (only available with innodb)

MySQL and NoSQL: Help me to choose the right one

60 million entries, select entries from a certain month. How to optimize database?

Rewriting mysql select to reduce time and writing tmp to disk

You will also want to read the following:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

Once you've sorted out your table designs and optimised your innodb config:

http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/

http://www.mysqlperformanceblog.com/2007/11/03/choosing-innodb_buffer_pool_size/

You can try something like:

start transaction;

insert into target_table (x,y) select x,y from source_table order by x,y;

commit;

Hope this helps.

Community
  • 1
  • 1
Jon Black
  • 16,223
  • 5
  • 43
  • 42
  • Wow, this really looks promising! Thanks alot already. Updating the indexes of the 20 mil table is going to take a while, I'll get back to this once it's done. – lordstyx Apr 05 '11 at 21:07
  • probably quicker to select into outfile in primary key order then import back using load data infile - 10 mins vs... hours :P – Jon Black Apr 05 '11 at 21:29
  • 1
    Alright, so I removed KEY `field1` (`field1`) from hugeCrossingTable and added PRIMARY KEY (`field1`,`field3`) and to otherTable I added KEY `theKey` (`field1`,`field2`) and that brought the time of the query down to 4 seconds. (Maybe it would have been faster to do that, but I did it overnight so that's no problem!) – lordstyx Apr 06 '11 at 11:32
  • Oh that should be 4 miliseconds. Sorry! – lordstyx Apr 06 '11 at 14:07
  • 1
    This optimization is great for reading, but what about writing? I have a table about 1/4th the size (but growing of course) of this one that's written to quite a lot, but also read from quite frequently (writes > reads). It takes about 30 seconds to read from, and the writing isn't quite fast either. Any ideas on how to get the time down on both queries? Adding more indexes would only make the reading faster – lordstyx Apr 06 '11 at 18:04
  • What if your engine is MyISAM? – haneulkim Jan 14 '20 at 02:09
0
hugeCrossingTable:  INDEX(field1,  field3, field2)
otherTable:         INDEX(field2,  field1, field3)

and drop the current KEY field1 (field1) as being redundant.

Both of those are "composite" and "covering".

MySQL will pick one of the tables to start with, then reach into the other one for the other columns needed. Without a WHERE clause, the Optimizer will probably pick the 'smaller' table. I provide indexes for either eventuality.

"Covering" is indicated in EXPLAIN SELECT ... as "Using index". It speeds things up by looking only in the Index's BTree, without needing to reach from there into the data's BTree.

With InnoDB, you really should include an explicit PRIMARY KEY.

The "All" indicates that it decided to scan the entire table. With "covering", it will, instead, scan the entire Index. This will be only slightly faster.

Be sure to run any timing test twice -- the first may be I/O-bound and the second would simply use the data already in cache (the buffer_pool). Or, if you have the "Query cache" turned on, the second call would not even perform the query; instead it would deliver the previous result. (I bring this up because 4ms is not possible for "millions" of rows.)

PARTITIONing is not a performance panacea. It would, if anything, slow down this query some.

A MEMORY table is no faster than already having the data cached in the buffer_pool.

What is the value of innodb_buffer_pool_size? How much RAM do you have? (Maybe you are doing I/O because of a too-small cache?)

MyISAM is going away; don't bother with it.

Rick James
  • 135,179
  • 13
  • 127
  • 222