2

I have a table called 'items' that looks something like this...

id | name
––––––––––––
1  | APPLES 
2  | BANANAS
3  | ORANGES
4  | PEARS

... and a junction table called 'pairs', creating many-to-many relationships between the items...

id | item1_id | item2_id
––––––––––––––––––––––––
1  | 1        |  2 
2  | 1        |  4
3  | 2        |  3
4  | 2        |  4
5  | 4        |  3

I have the following query to find items which are paired with a given item...

SELECT * FROM items i
WHERE
  i.id IN (SELECT item1_id FROM pairs WHERE item2_id = 4)
OR
  i.id IN (SELECT item2_id FROM pairs WHERE item1_id = 4)

Returning something like...

id | name
––––––––––––
1  | APPLES 
3  | ORANGES

...which does the job, however, it runs pretty slowly (with a small test dataset of approximately 100 items, 1000 pairings it's already taking about 75ms).

My question is – can this be optimised further to speed it up (e.g. using joins rather than nested queries)?

Thanks for any help.

teadog
  • 50
  • 7

2 Answers2

1

I think it will be sufficient to have indexes on pairs(item2_id, item1_id) and pairs(item1_id, item2_id) -- two separate indexes.

However, MySQL is sometimes funky about optimizing IN with subqueries. I would write this using exists:

SELECT i.*
FROM items i
WHERE EXISTS (SELECT 1
              FROM pairs p
              WHERE p.item2_id = 4 AND p.item1_id = i.id
             ) OR
      EXISTS (SELECT 1
              FROM pairs p
              WHERE p.item1_id = 4 AND p.item2_id = i.id
             );

These are guaranteed to use the indexes.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

The internal query optimizer does a great job at creating an execution plan, although you can look at the plan and identify bottlenecks. Things like expressing the same query in a different way generally don't make a huge difference at the end of the day. Even queries that are really crazy looking, you'd be surprised at how well the optimizer handles them and how two seemingly different expressions of the same query ultimately lead to the same thing. Changing that to use joins instead will probably lead to the same or similar execution plan.

So what I would do first is to create an index on your item1_id column, and a separate index on your item2_id column. This will help improve performance of those where clauses. Then, if that still doesn't meet your requirements, have a look at the Optimization chapter in the MySQL docs (for whichever version of MySQL you are using) for a full run-down of possible strategies. Note that it will benefit you to avoid heavy optimizations prematurely, especially if your application is complex. Once your application is in a mostly working state, you'll be in a better position to identify and address bottlenecks. But indices are always an easy and worthwhile first step at any development stage.

Jason C
  • 38,729
  • 14
  • 126
  • 182
  • That made a big difference. Execution time went from ~74ms down to ~3ms! I've never used indexes before, and helpful to know that MySQL optimises differently expressed queries similarly. Thanks! :) – teadog Feb 11 '17 at 18:07
  • 1
    @teadog Just don't fall into the trap of indexing *everything*, create indexes [where they help](http://stackoverflow.com/questions/5446124/mysql-why-not-index-every-field). PS With a larger dataset, you may also want to experiment with a single multi-column index on (item1_id,item2_id) vs. the two single-column indices I described here (I wouldn't do all three of those indices at the same time though unless there's a real need for it). – Jason C Feb 11 '17 at 19:11