2

Reading this wiki article, I found out that the SELECT performance is killed if using IN() clauses with indexed columns in a MySQL database. My question is, how can I rewrite my query so that it won't use any IN() clause while still keeping its functionality?

My query is:

SELECT 
    `Route`.`route_id`, `Route`.`order`, `Route2`.`order` 
FROM 
    `routes` AS `Route` 
INNER JOIN 
    `routes` AS `Route2` 
ON `Route`.`route_id` = `Route2`.`route_id` 
WHERE 
    `Route`.`station_line_id` IN ([10 values]) AND 
    `Route2`.`station_line_id` IN ([10 values]) AND 
    `Route`.`order` <= `Route2`.`order` 
GROUP BY `
    `Route`.`station_line_id`, `Route2`.`station_line_id`, (`Route2`.`order` - `Route`.`order`)

and I have indexed all columns (route_id, station_line_id, station_id and line_id), with the id column being the primary key (the table is just read-only once generated, so no worries for indexing everything). The [10 values] in the IN() clause are comma separated, like: IN(1, 2, ..., 10).

Basically, I self join the table routes table and group the results to get the desired records. The other joins are used for retrieving associated data.

Performance-wise, using InnoDB storage engine, I execute a similar query in >30seconds. Using MyISAM, I get >5seconds. But I believe results can be fetched even faster. I have ~4.5 million records in the table.

linkyndy
  • 17,038
  • 20
  • 114
  • 194
  • care to format your query a little bit? – juergen d Apr 13 '12 at 15:17
  • I've edited my question, sorry. – linkyndy Apr 13 '12 at 15:20
  • Is that 10 values: `IN (1,3,47, ...89)` or `IN (SELECT column FROM table)` ? – ypercubeᵀᴹ Apr 13 '12 at 15:25
  • The values are comma separated, hence `IN(1, 2, 3, ..., 10)`. – linkyndy Apr 13 '12 at 15:26
  • MySQL. I've edited my question to include all these details. – linkyndy Apr 13 '12 at 15:28
  • See this: http://stackoverflow.com/questions/782915/mysql-or-vs-in-performance . Unless you can replace the condition with something like BETWEEN or a range operation, you won't do better than IN(). – siride Apr 13 '12 at 15:32
  • Also this: http://www.mysqldiary.com/optimizing-the-mysql-in-comparison-operations-which-include-the-indexed-field/ – siride Apr 13 '12 at 15:34
  • Do you have an index on order? That may help as well as it's part of the `WHERE` clause. – Jeff Allen Apr 13 '12 at 15:34
  • Unfortunately, those 10 values are almost random, can't use BETWEEN. – linkyndy Apr 13 '12 at 15:34
  • What is the Primary Key of table `routes`? Is it `route_id` ? – ypercubeᵀᴹ Apr 13 '12 at 15:35
  • It's `id`, I've mentioned it in my question. `route_id` is used to group several records within the `routes` table. – linkyndy Apr 13 '12 at 15:38
  • I've added an index on the `order` column. Just a slight improvement in speed. – linkyndy Apr 13 '12 at 15:40
  • Another thing is that you group by `station_line_id` and yet you are showing a lot of other columns in the `SELECT` list. This is not standard SQL and it will show (more or less) random rows from the group. – ypercubeᵀᴹ Apr 13 '12 at 15:40
  • I've tweaked a little bit my query so it doesn't show more columns than it should. – linkyndy Apr 13 '12 at 15:48
  • I don't see a magic bullet here, but if you could dump the table structure (`SHOW CREATE TABLE Route`), that may provide a little more specific information regarding how everything's setup. – Jeff Allen Apr 13 '12 at 15:56
  • Unfortunately, this query is too complex to guess how MySQL will execute it. Please provide relevant schema and EXPLAIN result, so we can see what it's doing now. – Marcus Adams Apr 13 '12 at 16:08
  • I've figured it out. I was using a PHP framework (CakePHP) which was doing some unnecessary stuff that was taking so much time. Now queries execute ~1 second, which is more than decent. Thank you for your help, I will use your tips in the future! – linkyndy Apr 13 '12 at 16:31
  • 1
    It's GTFS ? Try using binary field for everything who's hex data. – David Bélanger Apr 13 '12 at 16:33

1 Answers1

1

You'll get the best performance in a query like this using a 'Hash index'. The 'standard' index is a B+ tree, which allows you to lookup entries in log(n) time where n is the number of rows in the table. They also maintain sorted order, so you can efficiently do queries like ... WHERE station_line_id > 14, so that's what you'll want to use on your Order column.

In your case, however, with an IN clause, you're only looking for equivalence. In that case, a B+ tree is going to have to lookup all m of your "[10 values]" separately, costing you m * log(n) time, which is apparently taking 5-30 seconds.

A hash index is used to lookup equivalent entries in a constant amount of time (very fast) which doesn't depend (theoretically) on the number of rows in your table -- it will always be very fast even on large tables. The downside of a hash index is that you can't use it to do queries like < or >, but it's the fastest at equivalence queries like the ones you're doing in your IN clause in station_line_id.

Edit: For MySQL specifically, they unfortunately don't support HASH indexes on any of their popular database engines. If you're able to use the MEMORY or HEAP engine, then you could use a HASH index -- and having everything in memory will likely improve performance quite a bit anyways. Worth a shot.

Jeff Allen
  • 17,277
  • 8
  • 49
  • 70
  • I am currently on a shared host and I believe storing such amounts of data in memory is not an option (or?). – linkyndy Apr 13 '12 at 15:37
  • 1
    MyISAM and InnoDB do not have Hash indexes. – ypercubeᵀᴹ Apr 13 '12 at 15:37
  • May still be worth a shot, depending on your data structure. I've got a 45 million row table which fits in 2.7GB of data and 1.1GB of index. At that rate, your table could take up as little as about .27 + .11 GB <= 400MB of memory. I don't know what your server requirements are, but 512MB isn't an unreasonable amount of memory for a VPS to have. Not sure whether or not that's an option for you, but I can guarantee a performance speedup. – Jeff Allen Apr 13 '12 at 15:46
  • I don't have any VPS, I'm just on a shared server, unfortunately. With time/memory limits which are not configurable. – linkyndy Apr 13 '12 at 15:50