1

I got to a point were I can not understand why the following MySQL query gets slower when I use an index in my where clause. The column that makes me crazy is called deleted. The table contains 4.8M rows.

The Query:

SELECT SQL_NO_CACHE SUM(amount)/100 FROM transactions WHERE (type="Payment" or type="Refund") and deleted is NULL

That query takes slightly above 11 seconds when the column is an Index and 3 seconds when its not indexed or when I use USE INDEX() which tell the optimizer not to use any index.

MySQL version 5.6, tested in AWS Aurora db.r5.xlarge (4CPU/32GB)

Table Structure:

id int(11) NOT NULL, type enum('Charge','Payment','Refund','Credit Adjustment','Debit Adjustment','Transfer') NOT NULL, amount int(11) NOT NULL, deleted datetime DEFAULT NULL, deleted_by int(11) DEFAULT NULL ENGINE=InnoDB DEFAULT CHARSET=utf8; ADD KEY type (type), ADD KEY deleted (deleted)

I would appreciate any clues here!

4 Answers4

1

I used "explain" to check the above query if the index can be used or not. As my result, the index doesn't work for either "OR" operator or "IN", so I think "UNION" is better choice. And I think you don't need to add index for "deleted" column, because it doensn't work as well.

"explain" result for IN operator: "explain" result for IN operator

"explain" result for OR operator: "explain" result for OR operator

"union" result: "union" result

index on "deleted" column doesn't work: index on "deleted" column doesn't work

ascripter
  • 5,665
  • 12
  • 45
  • 68
momo
  • 141
  • 6
  • 1
    please, put the images directly, not external links. – tremendows Oct 09 '19 at 13:23
  • MySQL will skip using an index if the values you search for match a large portion of the table. The threshold is not documented, but in my experience it's approximately 20% of the table. Would it be reasonable to say each type value occurs in under 20% of the table, but together they match more than 20%? And I would assume `deleted is NULL` matches much more than 20% of the rows. – Bill Karwin Oct 09 '19 at 17:22
0

(Edit: Apparently this is wrong for this particular situation. This answer only applies if the OR'd conditions involved different fields....or create a range check that prevents taking advantage of fields farther into an index. See comments for details.)

MySQL does not take advantage of indexes very well when presented with OR conditions. Often you can speed up a query like

SELECT a FROM b WHERE y = n1 OR y = n2

by expanding it to a union like this

SELECT a FROM b WHERE y = n1
UNION 
SELECT a FROM b WHERE y = n2

I've heard more recent versions have made such conditions expressed in the form y IN (n1, n2) a little more efficient, but my primary work in the last few years has been in MS SQL, so I can't say how much it has improved.

This can even be used in the case of your straight forward summing with a little more expansion....

SELECT SUM(subt) 
FROM (
   SELECT SUM(amount)/100 AS subt FROM transactions WHERE type="Payment" and deleted is NULL
   UNION 
   SELECT SUM(amount)/100 AS subt FROM transactions WHERE type="Refund" and deleted is NULL
) AS subq
Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • 2
    MySQL can handle `OR` terms just fine -- if all the terms are searching the same column. It's equivalent to an `IN()` predicate. This isn't just recent versions. MySQL has used an index in this case for many years. The UNION workaround is needed only if the `OR` terms are each searching _different_ columns. See my old answer here: https://stackoverflow.com/a/13866221/20860 – Bill Karwin Oct 08 '19 at 22:08
  • I'll defer to your expertise; the few times I've had to do something like this it was indeed for different columns. But are we meaning the same thing by "recent"? The version the question refers to is 5.6, which was released over 6 years ago. I thought the optimizations to `IN` didn't come until 8.x. – Uueerdo Oct 08 '19 at 22:14
  • There may be _new_ optimizations in 8.x, but MySQL has certainly used indexes for range queries for many years. Here's doc for 5.5, circa 2010: https://dev.mysql.com/doc/refman/5.5/en/range-optimization.html That's not the oldest MySQL version that supported using indexes for range queries, it's just the oldest doc that's still online. – Bill Karwin Oct 08 '19 at 22:18
  • Ah, I was under the mistaken impression that OR conditions on the same field were still not interpreted as a "range condition". – Uueerdo Oct 08 '19 at 22:30
  • @BillKarwin momo's testing/answer here seems to suggest the IN/OR is indeed impairing MySQL's ability to use the index; perhaps indexes cannot be used for IN/OR when the `enum` data type is involved? – Uueerdo Oct 09 '19 at 16:44
  • It could also be caused by the values they are searching for. If the values occur in a large portion of the table, MySQL will just skip using the index. It's more efficient to do a table-scan if the values match a majority of the rows anyway, for the same reason that an index at the back of a book doesn't bother indexing words that are too common. They may get different EXPLAIN results with a FORCE INDEX hint, which tells MySQL to assume a table-scan is too expensive. – Bill Karwin Oct 09 '19 at 17:14
  • 1
    @Uueerdo - Correction to your Edit: There are other situations where switching to `UNION` is beneficial, even with the same columns involved. For example (I think): `WHERE x=1 AND y IN(2,3) AND z>4` with `INDEX(x,y,z)`. In this case, it gets to use `z`. – Rick James Oct 15 '19 at 04:27
  • @RickJames Good point, updated the little disclaimer at the top further; it seems the cases where converting an IN/OR to a UNION doesn't help are getting fewer and fewer. – Uueerdo Oct 15 '19 at 16:17
0

I think I came up with a logical idea why using an indexed column would cause a delay. The problem should be in the data of that column and especially in its highly malformed distribution of unique values - respectively binary three nodes. It consists of 4.8 M rows with the same NULL value and just 30 K rows with 3 K unique values.

  1. When deleted index is used to find the NULL values it does not have significant effect of reducing the subset of rows that MySQL will further process but adds very significant amount of overhead activity dealing with the binary tree index. I suspect that without the index summing operation is faster enough so that it outperforms, even making full table scan, the benefits of reduced subset of rows that the index can provide but at the cost of significant indexing overhead.

  2. The data in that deleted column pumps up deleted index cardinality and makes it preferable for the optimizer over the type column index which has cardinality just 10. If values distribution in both columns was normal then its logical to prioritize using the one with higher cardinality and result a smaller subset for further processing. However this deleted column values' distribution is very malformed toward the null values. In the same manner as described above using deleted index for finding null values adds a lot of overhead but does not do much for the performance, prevent using the other more relevant indexes and thus results delay.

0

If you remove the index on just deleted and add this "composite" index:

INDEX(deleted, type)   -- in this order

it may run faster. Note that the = column comes first (IS NULL counts), then the IN (which your OR turns into).

Even faster might be to make the index "covering":

INDEX(deleted, type, amount)   -- in this order

Turning OR into UNION is a good trick, but not necessary here.

If deleted is rarely NULL, then the Optimizer may prefer that index, even if it turns out to be less efficient. (This may explain the problem you pose. My composite index avoids the issue.)

Independent issue: Why have deleted? Can't you simply have deleted_by being NULL to indicate the same thing?

Rick James
  • 135,179
  • 13
  • 127
  • 222