1

My Query looks like this

EXPLAIN SELECT 
    r.owner_id, 
    r.owner_address, 
    r.owner_platform,
    r.updated_at 
FROM some_owner_table as r 
WHERE 
    r.updated_at > '2022-09-16 22:16:38.832' 
ORDER BY 
    r.updated_at DESC LIMIT 200;

The result is

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'SIMPLE', 'r', NULL, 'range', 'INDEX_by_updated_at', 'INDEX_by_updated_at', '6', NULL, '1', '100.00', 'Using index condition'

However if we use a different date that, I think, increased the number of results we get :

'1', 'SIMPLE', 'r', NULL, 'ALL', 'INDEX_by_updated_at', NULL, NULL, NULL, '263', '37.64', 'Using where; Using filesort'

Using filesort seems problematic in terms of performance. It's not longer using Using index condition.

Is this how indexing really works or can we do something to further optimize our queries for this table?

EDIT: Table has 263 total rows.

EDIT: Create query:

CREATE TABLE `some_owner_table` (
  `owner_id` bigint(20) NOT NULL,
  `owner_address` bigint(20) NOT NULL,
  `owner_platform` int(11) NOT NULL,
  `updated_at` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
  PRIMARY KEY (`owner_id`,`owner_platform`),
  KEY `INDEX_by_updated_at` (`updated_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
NiceOneMoney
  • 149
  • 8
  • Yeah, that's why I deleted my comment. In this case, I think it's determined that the different date selects so much of the table that there's little point in using it.' – Barmar Jun 29 '22 at 20:14
  • seems like the optimizer is making a decision based on the where criterion without taking the limit into account. there are always going to be corner cases where the optimizer just isn't complex enough. I would try doing just the from...order by...limit in a subselect, and have the where condition applied in an outer select. – ysth Jun 29 '22 at 20:18
  • how many rows are there in the table? and how many rows would it find without the limit in the original query and the modified date query? – ysth Jun 29 '22 at 20:21
  • @ysth The entire table is 263 rows which is they the results confused me. – NiceOneMoney Jun 29 '22 at 20:22
  • 1
    with so few rows, I'm surprised it ever even bothers using an index. don't even worry about it. even with a hundred times as many rows, this isn't going to be slow. – ysth Jun 29 '22 at 20:25
  • @ysth. In prod we have 1M+ rows. – NiceOneMoney Jun 29 '22 at 20:26
  • 1
    then you need to test your queries against prod if you are concerned. don't waste time speculating how they will be optimized with non-real data – ysth Jun 29 '22 at 20:27
  • 1
    `filesort` sounds like a bigger performance nightmare than it is. It means the server accumulated a result set and then sorted it. It doesn't necessarily mean the sort operation used up the RAM in the server and spilled to an actual file-system file. In many cases the `ORDER BY ... LIMIT ...` query pattern needs a filesort. An index on the ordering column can sometimes allow such a query to be satisfied without a filesort. Can't say more without your table definition. – O. Jones Jun 29 '22 at 23:39

2 Answers2

2

MySQL's optimizer generally chooses to skip using the index if can infer that your condition would match a large enough portion of the table. In my experience, the threshold is about 20%, but this is not an official feature, and it may be different if the MySQL Server code changes.

The reason is that it actually takes more work per row to do an index lookup, then from that index entry fetch the whole row. The optimizer may assume that there's a point at which it's more economical to just walk the table row by row in primary key order, and keep the rows that match the condition.

But if the specific value you're searching for occurs on a small subset of the rows, then it's more economical to select those few index entries, then fetch the corresponding rows.

If you think the optimizer has made the wrong choice, you can use the FORCE INDEX hint, which makes the optimizer treat a table-scan as infinitely costly, so if the index is relevant at all, it'll use it.

It might be worthwhile, for example, to avoid the filesort. That is, if you force the query to scan rows by reading the index on updated_at, then sorting becomes a no-op, and it will avoid the filesort.

Another idea: If you're testing a very small dataset, the optimizer could reason that using an index doesn't matter, because the number of rows is going to fit in such a small number of pages in RAM anyway, and the cost of searching or sorting will be trivial.

This is why you should test optimization with a larger sample size, because the optimizer might make different choices for a trivially-sized dataset.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • It's using the INDEX for a smaller set (via WHERE) but is `Using filesort` for slightly larger set... Shouldn't it be the opposite? Uses the index for larger sets and just walks through for smaller? – NiceOneMoney Jun 29 '22 at 21:48
  • Again, it depends on the optimizer's estimate of how many rows it will match. Focus on the first part of my answer, not on the second part (after "another idea"). – Bill Karwin Jun 30 '22 at 12:28
0

The query planner in MySQL / MariaDB makes cost-based choices about whether to use indexes. Oversimplifying, it guesses how much CPU and IO time it will take to satisfy the query with or without using indexes, and chooses the smallest guess. Like @BillKarwin said, those choices can be very different for small (hundreds of rows) and large (millions) tables. So don't waste a whole ton of your time EXPLAINing queries on small tables.

You have the WHERE col > constant ORDER BY col DESC LIMIT 200 pattern in your query. When your query doesn't use an index the server does these things, roughly.

  • scans the table row-by-row looking for rows matching the WHERE clause.
  • places those rows into an internal data structure, a sort of lightweight table. That table, when mentioned in EXPLAIN output, is called a "file". It's not a file-system file.
  • when the table scan is done, it sorts that "file" as specified by ORDER BY.
  • finally, it sends you the first 200 rows from the sorted "file" and discards the rest.

But you have an index on the column in question. When it uses the index the server can work more efficiently. It

  • random-accesses the index for the first row matching the WHERE clause.
  • scans sequentially through the index until the last matching row
  • retrieves and sends you each matching row
  • and stops when it has sent 200 rows.

No sorting needed. (Hooray.)

If the query you mentioned has to be faster than it is in production, you can try recreating your index in descending order like this:

ALTER TABLE some_owner_table 
     DROP KEY INDEX_by_updated_at,
      ADD KEY INDEX_by_updated_at 
          (updated_at DESC);

This puts your index in descending order, matching your ORDER BY clause exactly. It is slightly less expensive to scan an index forward rather than backward, and this revised index lets that happen.

If it's still not fast enough, you can try creating a covering index. A covering index allows the server to satisfy the query completely from the index, rather than going from the index to the table* to look up every row's data. That looks like this.

ALTER TABLE some_owner_table 
     DROP KEY INDEX_by_updated_at,
      ADD KEY INDEX_by_updated_at 
           (updated_at DESC, owner_id, owner_address, owner_platform);

An index like this takes tablespace (SSD or disk space) and slightly slows down INSERTs and UPDATEs, because the server must update the index as well as the table. But tablespace is cheap. And most tables don't get nearly as many INSERTs or UPDATEs as they do SELECTs.

* In InnoDB, the table is actually stored inside the primary key. It's called a clustered index.

O. Jones
  • 103,626
  • 17
  • 118
  • 172