1

Here's a query I've written:

SELECT DISTINCT m.*, sm.*, s.*
FROM table_1 m
LEFT JOIN table_2 sm ON m.master_id = sm.master_id
INNER JOIN (
    SELECT s1.*, rn.field1, d.field2, m.field3, dt.field4, gl.field5
    FROM table_3 s1
        LEFT JOIN table_4 rn ON s1.secondary_id = rn.secondary_id
        LEFT JOIN table_5 d  ON s1.trait_id = d.trait_id
        LEFT JOIN table_6 m  ON s1.mix_id = m.mix_id
        LEFT JOIN table_7 dt ON s1.debit_id = dt.debit_id
        LEFT JOIN table_8 gl ON s1.graph_id = gl.graph_id
    WHERE s1.secondary_id = 26
        AND s1.dimension_id = 24
        AND s1.mix_id = 43
) s ON sm.spec_id = s.spec_id
WHERE m.master_id = 1

I'm testing it against a VERY small table (only ~3000-5000 records per table) and getting acceptable results (4.8 ms), from within a VM on my laptop.

My question/concern is what about when the tables get larger, so that the entire DB can't reside in memory?

Obviously, all of the *_id columns have indices on them (whether it's the primary key column or foreign key in the s1 table. I also have a multi column index on s1.secondary_id, s1.dimension_id, s1.mix_id.

Is this adequate, or would anyone that's more versed than me know if I should employ an different indexes, or use a different strategy altogether, for this sort of query (Query against Table1, with a join to Table2, and another join against a subquery) - it's the last part thats giving me pause about scalability.

Any thoughts are appreciated.

PS - EXPLAIN states I might be using a temporary table (i assume from the join), but beyond that I"m not sure what I'm looking at, with the types being const, ref, eq_ref, etc.

Again, it works fine right now with a very small amount of test data. I just don't want it to grind to a halt once it gets production levels of data.

Thanks

Lucas Krupinski
  • 682
  • 5
  • 16

3 Answers3

3

Comments:

  • Subqueries in the from clause in MySQL can kill performance, because they are generally materialized.
  • The inner join to the subquery undoes the left join because a the on clause requires that the keys be non-NULL.
  • You should compare numeric columns to numbers not strings. You probably don't want the single quotes.
  • An index on table_3(secondary_id, dimension_id, mix_id) is definitely called for (other columns may be useful after those).
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • You want indexed on the WHERE clause columns. – duffymo Aug 24 '17 at 01:10
  • HI Gordon, in regards to your thoughts. Are Views any better performance wise than subqueries? Or the same? I'm not worried if it takes up more storage, just looking for performance. Also, the single quotes are not actually in the real query in real life, I wrote a PHP handler that replaces PDO placeholders with the data in the query, just to be be able to see the query that's being executed, more or less. – Lucas Krupinski Aug 24 '17 at 01:15
  • Also, there is a 4 column index, with those 3 columns indexed, plus another one that can sometimes be used as search criteria. Sometimes COLUMNS 1, 2, 3, 4 will be queried, other times 2, 3, 4, other times, 1,3,4, other times 1,2, and so on. Is one multicolumn index like that appropriate? Or should i create indexes for each combination that could included in the final query? – Lucas Krupinski Aug 24 '17 at 01:17
  • Equality conditions need to match the left most keys of the index (generally) so one index can handle multiple conditions. From what you say, you need multiple indexes. As for your other question, a view would substitute the results in-line without materializing a temporary table, so it might actually be faster. – Gordon Linoff Aug 24 '17 at 01:56
  • @GordonLinoff - "subqueries are generally materialized" -- I think this is too strong a statement. Furthermore, the Optimizer is getting smarter about not doing that, such as turning `IN`, which used to be notorious, into `EXISTs`. – Rick James Aug 24 '17 at 13:43
  • @RickJames . . . I changed that to "subqueries in the `from` clause". – Gordon Linoff Aug 27 '17 at 21:11
  • @GordonLinoff - While I agree in general, there are exceptions. [_Example_](https://stackoverflow.com/questions/35897180/why-does-mysql-not-use-optimal-indexes/35905072#35905072) . General case: when the derived table has `GROUP BY` or `LIMIT` to significantly decrease the number of rows. – Rick James Aug 27 '17 at 21:43
0

Like you mentioned your query will work fine for the fewer number of records. But once your DB starts getting millions (or 100s of thousands) of records your query will slow down.

You can take precautions in the beginning so that your query will not slow down:

  1. Break the query into multiple queries : This way your database will also not run out of memory.

  2. check the query execution plan to see if your query is always using the indexes. It will also tell you how much time is being used in the various join operations. And the step the DB is taking to optimize your query.

  3. As the data is fed into the tables, there table and index space keeps on getting fragmented. Do routine de-fragmentation

  4. One thing for sure is Relational databases are not designed for horizontal scaling but they scale very well vertically so installing additional hardware (extra memory , CPU) also help

Geek
  • 627
  • 1
  • 8
  • 18
  • As a "general guideline" is a table with <10,000 records considered small enough to avoid indexes? – LWSChad Jun 21 '22 at 11:45
0
  • Don't use LEFT unless the righthand table is really optional.
  • Do have a 3-column 'composite' index on s1.
  • Do provide EXPLAIN SELECT to allow us to better answer your Question. We need to see more than just that it used a temp table. Maybe the Explain would provide enough info to answer your scalability question.
  • Do not break a query into multiple queries -- unless you are really smarter than the Optimizer.
  • Do not use OPTIMIZE TABLE; it is not worth the effort on InnoDB tables. BTrees are self-defragmenting (to some degree).
  • MySQL does not make use of multiple CPUs in a single connection.
  • More RAM helps in some use cases.
  • DISTINCT (usually) requires a temp table and an extra pass over the data. Since you are doing SELECT *, the only way you would need to de-dup is if you actually have dups in the tables. That seems unlikely.
  • varchar = numeric-constant is the only combination that has trouble. The commonly used int = 'number' optimizes OK.

Explains

  • Note the "Rows" column in the EXPLAINs. They are all small numbers.
  • Note 'Primary' and 'eq-ref'. Those are (usually) very good things to have.
  • So, tentatively, I predict that this query will still run fast, even when the tables are too big to be cached. This is because only a few rows in each table (only 1 row in some of the tables) will need to be touched. No apparent table scans or other thing that would be very costly if too big.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • OK, I've converted all the LEFT joins to INNER joins. Here is a screen cap of the query and the output from EXPLAIN (http://imgur.com/a/yqkmv). And yes, it looks like it works fine without the DISTINCT. I was getting apparent duplicate records when i wrote the query without the ending WHERE, so i added distinct and never got rid of it. And without distinct, it does not create the temporary table. LINK: http://imgur.com/a/gXU1Y – Lucas Krupinski Aug 24 '17 at 18:45