Why is InnoDB so painfully slow on full table scans even though entire data is in buffer pool?

Question

Environment

AMD Ryzen 9 5950X, 128GB 3200MHz Dual Channel
Datacenter NVMe SSDs with 3GB/s+ read and write
MariaDB 10.6.3 x64
Windows Server 2019 (same issue on Debian though)
dedicated machine, no other tasks running

my.ini

[mysqld]
default-storage-engine=INNODB
log-output=NONE
general-log=0
general_log_file="mariadb.log"
slow-query-log=0
query_cache_type=OFF
query_cache_size=0
innodb_buffer_pool_size=64G

DDL

CREATE TABLE testinnodb
(
    a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL,
    i FLOAT NOT NULL, j FLOAT NOT NULL, k FLOAT NOT NULL,
    x CHAR(20) NOT NULL, y CHAR(20) NOT NULL, z CHAR(20) NOT NULL
) ENGINE=InnoDB;

Same schema for MyISAM and Memory.

Tables are filled with 10M rows of random data, resulting data sizes:
InnoDB: 1.0 GB
MyISAM: 810 MB
Memory: 867 MB

SQL

SELECT * FROM testinnodb WHERE c=1;
SELECT * FROM testmyisam WHERE c=1;
SELECT * FROM testmemory WHERE c=1;

InnoDB: 2.4s !!!
MyISAM: 0.3s
Memory: 0.2s

The queries are run multiple times but performance stays the same. EXPLAIN gives the same output for all three queries (SIMPLE, USING WHERE).

This is clearly not an I/O issue, given the hardware and the performance of MyISAM and Memory in comparison.

64GB for the buffer pool is also more than enough to hold all of that table in memory.
The data must be in the buffer pool because disabling innodb_buffer_pool_load_at_startup, the query will take 4.2s on first run, and then 2.4s in subsequent runs.
innodb_buffer_pool_bytes_data will have grown by over 1GB after the first run, so it looks like the entire data is in fact in the buffer pool.
innodb_buffer_pool_read_requests does increase by about 10M on each execution.

Why the hell is reading the data using InnoDB from the buffer pool (i.e. RAM) 10 times slower than reading the same data using MyISAM (i.e. from SSD)?

I need help understanding what is going on. Surely this can't be right? I've tried playing around with the DB config (e.g. innodb_old_blocks_time=0, innodb_read_io_threads=32 and innodb_write_io_threads=32) but this literally changed nothing.

I know using an INDEX will improve things, but this is not the point.

Let me know if you need some status variables for debugging, I'm new to InnoDB so I'm not sure what was relevant to post here.

Output of `SHOW ENGINE InnoDB STATUS;` after startup and querying the InnoDB table twice

=====================================
Per second averages calculated from the last 50 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 50 srv_idle
srv_master_thread log flush and writes: 50
----------
SEMAPHORES
----------
------------
TRANSACTIONS
------------
Trx id counter 20001586
Purge done for trx's n:o < 20001583 undo n:o < 0 state: running
History list length 14
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION (000002727BD64108), not started 
0 lock struct(s), heap size 1128, 0 row lock(s)
--------
FILE I/O
--------
Pending flushes (fsync) log: 0; buffer pool: 0
71726 OS file reads, 2 OS file writes, 2 OS fsyncs
0.00 reads/s, 16413 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 8881303100
Log flushed up to   8881303100
Pages flushed up to 8881303100
Last checkpoint at  8881303088
0 pending log flushes, 0 pending chkp writes
4 log i/o's done, 0.08 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 68753031168
Dictionary memory allocated 424846000
Buffer pool size   4147712
Free buffers       4075870
Database pages     71842
Old database pages 26539
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 90.000
Pending reads 0
Pending writes: LRU 0, flush list 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 71711, created 131, written 0
1434.19 reads/s, 2.62 creates/s, 0.00 writes/s
Buffer pool hit rate 996 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 71842, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 read views open inside InnoDB
Process ID=0, Main thread ID=0, state: sleeping
Number of rows inserted 0, updated 0, deleted 0, read 20000000
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 399992.00 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

This isn't an answer, but I would guess your MyISAM table is also in RAM, since MyISAM uses buffered I/O through the filesystem cache. I don't know of any way to disable that to get a true I/O test. So MyISAM is doing a sequential read of row data only. Whereas InnoDB stores the table as a clustered index, so it has to traverse that B-tree even for a table-scan. MyISAM is superior for doing table-scans, that has always been true. — Bill Karwin, Jul 14 '21 at 14:45
@BillKarwin Thank you for the insight! Even in a B-tree structure, 2.4 seconds to scan just 1GB of data sounds incredibly slow for one of the fastest (single-threaded) CPUs on the planet. But I guess I have to take it. Is there maybe a trick to somewhat improve performance for such scenarios? Tune B-tree configurations or have InnoDB have a copy of the data in memory that is laid out in a more favorable format or something? — Alex K., Jul 14 '21 at 15:13
Not to my knowledge. InnoDB records are stored as basically a complex web of linked lists, so doing a table-scan has to hop from record to record across many pages. Each record links to the next (and the previous) record. The pages are not necessarily contiguous on disk, and even less so when the pages are loaded into the buffer pool. InnoDB is clearly not optimized for table-scans, it's optimized for CRUD-like query patterns. — Bill Karwin, Jul 14 '21 at 15:33
If you want to learn everything about how InnoDB records and pages are stored, this developer reverse-engineered it and posted a series of blogs describing every data structure: https://blog.jcole.us/innodb/ He also provides tools to dump the contents of the InnoDB structures. — Bill Karwin, Jul 14 '21 at 15:36

score 3 · Accepted Answer · answered Jul 16 '21 at 11:43

There is quite a bit of overhead in the interface between InnoDB and the SQL interpreter of the MySQL or MariaDB server.

In InnoDB, each access must be protected by a buffer pool page latch. A mini-transaction object will keep track of the acquired latches. Basically, for every fetched row, InnoDB will start a mini-transaction, look up the B-tree leaf page in the buffer pool, acquire the page latch, copy the data, and finally commit the mini-transaction and release the page latch.

There are a couple of optimizations on top of this, but this is insufficient, and it would be better to implement MDEV-16232 to allow a mini-transaction to persist across the entire range scan. In that way, we would only acquire and release page latches when advancing to the next page.

In range scans, a persistent cursor (btr_pcur_t) will store the current position. When the cursor position is restored at the start of the next mini-transaction (to fetch the next record), an optimistic restore will be attempted, with the assumption that the old pointer to the buffer pool page is still valid.

InnoDB also implements a prefetch buffer. After 4 next-record read operations, InnoDB will copy 8 records at a time to the buffer, within a single mini-transaction. Subsequent requests will then be satisfied from this buffer. This mechanism would be made redundant by MDEV-16232 and should be removed as part of implementing it.

Implementing MDEV-16232 would also speed up UPDATE and DELETE operations, by removing the need to acquire explicit record locks. If we continuously hold the page latch for the whole duration of deleting or updating a row, we can rely on implicit locking whenever no conflicts exist, just like we do in the INSERT case.

A very insightful post, thanks! Too bad that JIRA task has been open for over 3 years now even though it's confirmed and has a "major" priority. Hopefully a future MySQL/MariaDB update will have this new implementation and provide improved performance for such scenarios. In the meantime, is there maybe any workaround or "hack" one could use to improve performance in such instances? — Alex K., Jul 16 '21 at 15:16
Unfortunately, I am not aware of any work-around. Other improvements were deemed more important in the mean time. We can only achieve so much with a small team of InnoDB developers at MariaDB. If this will not happen in the 10.7 release, it deserves to be in the 10.8 roadmap, because it should be a notable performance improvement. — Marko Mäkelä, Jul 16 '21 at 16:46

Why is InnoDB so painfully slow on full table scans even though entire data is in buffer pool?

Environment

my.ini

DDL

SQL

Why the hell is reading the data using InnoDB from the buffer pool (i.e. RAM) 10 times slower than reading the same data using MyISAM (i.e. from SSD)?

Output of `SHOW ENGINE InnoDB STATUS;` after startup and querying the InnoDB table twice

1 Answers1

Linked

Why is InnoDB so painfully slow on full table scans even though entire data is in buffer pool?

Environment

my.ini

DDL

SQL

Why the hell is reading the data using InnoDB from the buffer pool (i.e. RAM) 10 times slower than reading the same data using MyISAM (i.e. from SSD)?

Output of SHOW ENGINE InnoDB STATUS; after startup and querying the InnoDB table twice

1 Answers1

Linked

Output of `SHOW ENGINE InnoDB STATUS;` after startup and querying the InnoDB table twice