Full text searches take minutes rather than seconds. The table has 50K rows. Each ocr_text contains a lot of text, all the text from 2 full pages of newspaper.
MariaDB [scu_db]> describe frames;
+-------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| dbRollID | int(10) unsigned | NO | MUL | 0 | |
| frame_num | int(10) unsigned | NO | | 0 | |
| xLeadEdge | int(11) | NO | | 0 | |
| yTrailEdge | int(11) | NO | | 0 | |
| wLeftEdge | int(11) | NO | | 0 | |
| hRightEdge | int(11) | NO | | 0 | |
| ocr_text | mediumtext | NO | MUL | NULL | |
| ocr_rects | longblob | NO | | NULL | |
+-------------+------------------+------+-----+---------+----------------+
Also the more often a word appears in the data, the slower the query.
This query takes 10 seconds:
SELECT dbRollID
FROM frames
WHERE MATCH (ocr_text) AGAINST ('+1912' IN BOOLEAN MODE)
ORDER BY id
This query for a common word takes 2.5 minutes:
SELECT dbRollID
FROM frames
WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY id
It makes no difference if I add LIMIT 50
(which I would think it should?).
This seem horrifically slow. What am I doing wrong here?
SHOW CREATE TABLE
result:
CREATE TABLE `frames` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`dbRollID` int(10) unsigned NOT NULL DEFAULT 0,
`frame_num` int(10) unsigned NOT NULL DEFAULT 0,
`xLeadEdge` int(11) NOT NULL DEFAULT 0,
`yTrailEdge` int(11) NOT NULL DEFAULT 0,
`wLeftEdge` int(11) NOT NULL DEFAULT 0,
`hRightEdge` int(11) NOT NULL DEFAULT 0,
`ocr_text` mediumtext COLLATE utf8mb4_unicode_ci NOT NULL,
`ocr_rects` longblob NOT NULL,
PRIMARY KEY (`id`),
KEY `fk_roll_id_1` (`dbRollID`),
FULLTEXT KEY `ocr_text` (`ocr_text`),
CONSTRAINT `fk_roll_id_1` FOREIGN KEY (`dbRollID`) REFERENCES `scansettings` (`dbRollID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=474139 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
explain SELECT dbRollID FROM frames WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE) ORDER BY id
result:
+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+
| 1 | SIMPLE | frames | fulltext | ocr_text | ocr_text | 0 | | 1 | Using where; Using filesort |
+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+
Edit / Update:
OK I have a hypothesis, but it may be based on incorrect assumptions.
So I have been told that by using FULLTEXT KEY on the text column, that MySQL creates an index for that column, and then uses that index smartly when you do a full text search (i.e. MATCH AGAINST in innodb engine). So in theory I am not supposed to have to overthink this and create my own indexing system. Is this true?
OK assuming that this is true, then MySQL creates this word index in some clever way, since there aren't really that many different words. Right?
But. My text is OCR text. And it is very bad OCR text. And it is very bad OCR text performed on bagillions of images, many of which don't even contain OCRable text, but cursive writing. So, I have bagillions of garbage words in my text. Maybe it's mostly garbage words. I don't know. But then does this mean that the index that MySQL builds will be huge? And huge enough that it can't be cached in innodb's memory buffers?
Thoughts? Could this be true? If so, if I could clean out all the garbage from my text maybe it will work?
SHOW TABLE STATUS LIKE "frames";
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment | Max_index_length | Temporary |
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+
| frames | InnoDB | 10 | Dynamic | 51419 | 80792 | 4154245120 | 0 | 3997696 | 7340032 | 474566 | 2021-08-17 13:26:24 | 2021-08-19 18:03:18 | NULL | utf8mb4_unicode_ci | NULL | | | 0 | N |
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+
EXPLAIN SELECT on RDS (different but similar server, MySQL rather than MariaDB)
explain SELECT dbRollID FROM frames WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE) ORDER BY id;
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+
| 1 | SIMPLE | frames | NULL | fulltext | ocr_text | ocr_text | 0 | const | 1 | 100.00 | Using where; Ft_hints: no_ranking; Using filesort |
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+
Update 8/27/2021 doing more experiments: I do not understand why limiting the scope of the query, either with additional WHERE conditions, or using a subquery, or using LIMIT, has absolutely no effect on the time it takes:
SELECT dbRollID FROM frames
WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY dbRollID;
29219 rows in set (1 min 46.959 sec)
SELECT xLeadEdge FROM frames
WHERE dbRollID=110
AND MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY xLeadEdge;
340 rows in set (1 min 45.984 sec)
SELECT * FROM frames
WHERE dbRollID=110;
512 rows in set (0.272 sec)
SELECT xLeadEdge
FROM
(
SELECT * FROM frames WHERE dbRollID=110
) AS a
WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY xLeadEdge;
340 rows in set (1 min 47.044 sec)
SELECT dbRollID FROM frames
WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY dbRollID LIMIT 1;
1 row in set (1 min 46.575 sec)
SELECT xLeadEdge FROM frames
WHERE dbRollID=110 AND MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
ORDER BY xLeadEdge LIMIT 1;
1 row in set (1 min 46.939 sec)
(Note: and the 1 min 45 rather than 3 min something is because I have the innodb buffer set MUCH larger than original testing. But a near 2 minute query is still uselessly slow.)