4

We use composite indexes in many places and they work great. Our platform has a relatively complex "report builder" that dynamically writes queries based on what the users select. We choose indexes based on common queries and often our composite indexes line up well with what users want.

For instance (simplifying a ton), a user's report will turn into the query:

SELECT
  colA,
  colB,
  SUM(col1),
  SUM(col2)
FROM
  someTable
WHERE
  colA = "foo"
  AND colB = "bar"
  AND colC = "baz"
GROUP BY 1, 2;

And the table will look something like:

CREATE TABLE someTable (
    id PRIMARY KEY,
    col1 data_type,
    col2 data_type,
    colA data_type,
    colB data_type,
    colC data_type,
    INDEX someIndex (colA,colB,colC)
);

I know MySQL won't be able to use that index on a query like:

SELECT
  colA,
  colB,
  SUM(col1),
  SUM(col2)
FROM
  someTable
WHERE
  colB = "foo"
GROUP BY 1, 2;

My question is - will MySQL use the composite index with a query like:

SELECT
  colA,
  colB,
  SUM(col1),
  SUM(col2)
FROM
  someTable
WHERE
  colA != ""
  AND colB = "foo"
GROUP BY 1, 2;

I know, of course, that we could just add an index on colB, but that's not a good option in this case. I've simplified here to make the question clear. Realistically, we have like 100 columns in this table and can't add indexes on each column and possible column variation, so I'm trying to figure out if we can make some use of what we already have to make things a bit faster for those edge-case queries.

Hope I worded that well. Thanks in advance :)

2 Answers2

1

My question is - will MySQL use the composite index with a query like: ...

No, it won't. Not equals (!=) match is not indexable, so the index would have to start with colB for the index to be usable. If you change the index from (colA, colB, colC) to (colB, colA, colC), that index would be usable for both the first query you mention and the one you are asking about.

Gordan Bobić
  • 1,748
  • 13
  • 16
  • What if it wasn't `!=`? For instance, imagining colA was an integer column that we used as an ID, would `colA > 0` work? – yyyyyyyyyyyyyyyyy Jun 08 '20 at 00:13
  • 1
    You should have `=` matches first, followed by a range match. Anything after that won't help, except as a covering index to avoid the tablespace row access. – Gordan Bobić Jun 08 '20 at 00:17
  • I think `!=` acts like a "range", where it hops over the value being avoided. – Rick James Jun 08 '20 at 01:29
  • @RickJames, I don't think it does, because just specifying a single `foo != 1` won't use an index on `(foo)` at all, but `foo > 1` will use an index. – Gordan Bobić Jun 08 '20 at 06:57
  • 1
    @GordanBobic - Let's discuss further; can you propose a test case that would highlight the point? – Rick James Jun 08 '20 at 18:51
  • @RickJames `EXPLAIN SELECT * FROM foo WHERE bar != 1;` I would expect that to not use index(bar) even if one is present. – Gordan Bobić Jun 08 '20 at 18:54
  • @GordanBobic - That one is tricky to test. _If_ it goes by this rule, it would not use the index: "If more than X% of the index will be hit, don't use the index". (where X ~= 20%). – Rick James Jun 08 '20 at 22:36
  • @RickJames that's just it, if you turn `!=` into a `<` range and a `>` range, you'll hit 50% of the index below and 50% of the index above, unless the field has cardinality so low that the index is useless anyway. – Gordan Bobić Jun 08 '20 at 23:04
  • @GordanBobic - Or 90% of the table is 0. If the Optimizer does the probes, it can get something better than the crude "cardinality". (I haven't read the code; I am only analyzing as a "user".) – Rick James Jun 08 '20 at 23:20
  • @RickJames - I think we are getting lost in the nuances of improbability. For all intents and purposes it is usually safe to assume that `!=` is going to be slow. – Gordan Bobić Jun 09 '20 at 00:35
1
SELECT SUM(..), SUM(..)
    ...
    GROUP BY 1,2

does not make sense. Generally, one GROUPs BY the scalar column(s), not the aggregates. Remove the GROUP BY, you will probably get the same answer, and get it a little faster.

When building an index, start with columns tested with = (or IS NULL).

Then move on to any IN(constant list), since it sometimes acts like =, sometimes acts like a range.

Then one "range" -- BETWEEN or any inequality

So, for

WHERE
  colA != ""
  AND colB = "foo"

colB should come first. Hence `INDEX(colB, colA) in that order works well for that query (and the one before it).

As was already noted, INDEX(colB, colA, colC) in that order will work well for all three queries, but may not work well for other queries.

The order of columns in the INDEX matters; the order in the WHERE does not. The cardinality of individual components of a composite index does not matter.

Longer discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Whoa! Rick James! Your website is where I learned half of what I know about SQL optimization . Grouping on sum was an error while writing the question. Fixed in edit! Your answer totally makes sense. I guess the underlying main question is, will MySQL be able to use a composite index at all if I "fake" one of the columns by setting it to `col > 0` or something if that WHERE is not actually important in the query – yyyyyyyyyyyyyyyyy Jun 08 '20 at 00:14
  • @yyyyyyyyyyyyyyyyy - `col > 0` looks (to the Optimizer) like a "range". An improved question: `col INT UNSIGNED NOT NULL` and `col >= 0`. Even for that, I am pretty sure the Optimizer is not smart enough to throw out the unnecessary `col >= 0`. – Rick James Jun 08 '20 at 01:26
  • @yyyyyyyyyyyyyyyyy - For your "100 columns", you might find some tips in http://mysql.rjweb.org/doc.php/eav – Rick James Jun 08 '20 at 01:30