1

Full text searches take minutes rather than seconds. The table has 50K rows. Each ocr_text contains a lot of text, all the text from 2 full pages of newspaper.

MariaDB [scu_db]> describe frames;
+-------------+------------------+------+-----+---------+----------------+
| Field       | Type             | Null | Key | Default | Extra          |
+-------------+------------------+------+-----+---------+----------------+
| id          | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| dbRollID    | int(10) unsigned | NO   | MUL | 0       |                |
| frame_num   | int(10) unsigned | NO   |     | 0       |                |
| xLeadEdge   | int(11)          | NO   |     | 0       |                |
| yTrailEdge  | int(11)          | NO   |     | 0       |                |
| wLeftEdge   | int(11)          | NO   |     | 0       |                |
| hRightEdge  | int(11)          | NO   |     | 0       |                |
| ocr_text    | mediumtext       | NO   | MUL | NULL    |                |
| ocr_rects   | longblob         | NO   |     | NULL    |                |
+-------------+------------------+------+-----+---------+----------------+

Also the more often a word appears in the data, the slower the query.

This query takes 10 seconds:

SELECT dbRollID 
   FROM frames 
   WHERE MATCH (ocr_text) AGAINST ('+1912' IN BOOLEAN MODE) 
ORDER BY id

This query for a common word takes 2.5 minutes:

SELECT dbRollID 
   FROM frames 
   WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE) 
ORDER BY id

It makes no difference if I add LIMIT 50 (which I would think it should?).

This seem horrifically slow. What am I doing wrong here?

SHOW CREATE TABLE result:

CREATE TABLE `frames` (  
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  
  `dbRollID` int(10) unsigned NOT NULL DEFAULT 0,  
  `frame_num` int(10) unsigned NOT NULL DEFAULT 0,  
  `xLeadEdge` int(11) NOT NULL DEFAULT 0,  
  `yTrailEdge` int(11) NOT NULL DEFAULT 0,  
  `wLeftEdge` int(11) NOT NULL DEFAULT 0,  
  `hRightEdge` int(11) NOT NULL DEFAULT 0,  
  `ocr_text` mediumtext COLLATE utf8mb4_unicode_ci NOT NULL,  
  `ocr_rects` longblob NOT NULL,  
  PRIMARY KEY (`id`),  
  KEY `fk_roll_id_1` (`dbRollID`),  
  FULLTEXT KEY `ocr_text` (`ocr_text`),  
  CONSTRAINT `fk_roll_id_1` FOREIGN KEY (`dbRollID`) REFERENCES   `scansettings` (`dbRollID`) ON DELETE CASCADE ON UPDATE CASCADE  
) ENGINE=InnoDB AUTO_INCREMENT=474139 DEFAULT CHARSET=utf8mb4   COLLATE=utf8mb4_unicode_ci;  

explain SELECT dbRollID FROM frames WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE) ORDER BY id result:

+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+
| id   | select_type | table  | type     | possible_keys | key      | key_len | ref  | rows | Extra                       |
+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+
|    1 | SIMPLE      | frames | fulltext | ocr_text      | ocr_text | 0       |      | 1    | Using where; Using filesort |
+------+-------------+--------+----------+---------------+----------+---------+------+------+-----------------------------+

Edit / Update:

OK I have a hypothesis, but it may be based on incorrect assumptions.

So I have been told that by using FULLTEXT KEY on the text column, that MySQL creates an index for that column, and then uses that index smartly when you do a full text search (i.e. MATCH AGAINST in innodb engine). So in theory I am not supposed to have to overthink this and create my own indexing system. Is this true?

OK assuming that this is true, then MySQL creates this word index in some clever way, since there aren't really that many different words. Right?

But. My text is OCR text. And it is very bad OCR text. And it is very bad OCR text performed on bagillions of images, many of which don't even contain OCRable text, but cursive writing. So, I have bagillions of garbage words in my text. Maybe it's mostly garbage words. I don't know. But then does this mean that the index that MySQL builds will be huge? And huge enough that it can't be cached in innodb's memory buffers?

Thoughts? Could this be true? If so, if I could clean out all the garbage from my text maybe it will work?

SHOW TABLE STATUS LIKE "frames";
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+
| Name   | Engine | Version | Row_format | Rows  | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time         | Check_time | Collation          | Checksum | Create_options | Comment | Max_index_length | Temporary |
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+
| frames | InnoDB |      10 | Dynamic    | 51419 |          80792 |  4154245120 |               0 |      3997696 |   7340032 |         474566 | 2021-08-17 13:26:24 | 2021-08-19 18:03:18 | NULL       | utf8mb4_unicode_ci |     NULL |                |         |                0 | N         |
+--------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------+------------------+-----------+

EXPLAIN SELECT on RDS (different but similar server, MySQL rather than MariaDB)

explain SELECT dbRollID FROM frames WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE) ORDER BY id;
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+
| id | select_type | table  | partitions | type     | possible_keys | key      | key_len | ref   | rows | filtered | Extra                                             |
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+
|  1 | SIMPLE      | frames | NULL       | fulltext | ocr_text      | ocr_text | 0       | const |    1 |   100.00 | Using where; Ft_hints: no_ranking; Using filesort |
+----+-------------+--------+------------+----------+---------------+----------+---------+-------+------+----------+---------------------------------------------------+

Update 8/27/2021 doing more experiments: I do not understand why limiting the scope of the query, either with additional WHERE conditions, or using a subquery, or using LIMIT, has absolutely no effect on the time it takes:

SELECT dbRollID FROM frames
    WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
    ORDER BY dbRollID;
29219 rows in set (1 min 46.959 sec)

SELECT xLeadEdge FROM frames
    WHERE dbRollID=110
      AND MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
    ORDER BY xLeadEdge;
340 rows in set (1 min 45.984 sec)

SELECT * FROM frames
    WHERE dbRollID=110;
512 rows in set (0.272 sec)

SELECT xLeadEdge
    FROM 
    (
        SELECT * FROM frames WHERE dbRollID=110
    ) AS a
    WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
    ORDER BY xLeadEdge;
340 rows in set (1 min 47.044 sec)

SELECT dbRollID FROM frames
    WHERE MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
    ORDER BY dbRollID LIMIT 1;
1 row in set (1 min 46.575 sec)

SELECT xLeadEdge FROM frames
    WHERE dbRollID=110 AND MATCH (ocr_text) AGAINST ('+john' IN BOOLEAN MODE)
    ORDER BY xLeadEdge LIMIT 1;
1 row in set (1 min 46.939 sec)

(Note: and the 1 min 45 rather than 3 min something is because I have the innodb buffer set MUCH larger than original testing. But a near 2 minute query is still uselessly slow.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
Paul
  • 13
  • 1
  • 5
  • instead of the describe, can you show output of `show create table frames;` please? – ysth Aug 20 '21 at 04:50
  • roughtly, what's the average length of ocr_text? – ysth Aug 20 '21 at 04:51
  • 2500 words / 12500 characters, I will reply again with create table... – Paul Aug 20 '21 at 05:08
  • 1
    @Paul I have changed the tag from MySQL to MariaDB. Although the two products have a common origin, they are now sufficiently different that they should be regarded as different products. Feel free to change it back if appropriate – Tangentially Perpendicular Aug 20 '21 at 05:13
  • ysth too big for reply so I added the CREATE TABLE to the post – Paul Aug 20 '21 at 05:15
  • @TangentiallyPerpendicular OK but I run this on MariaDB on XAMPP and on MySQL 5.7 on AWS RDS to the same effect. – Paul Aug 20 '21 at 05:17
  • I removed the second question, as it was muddling the first question and is unrelated. – Paul Aug 20 '21 at 18:25
  • Please provide `EXPLAIN SELECT ...` for each query. How many rows have "John"? – Rick James Aug 20 '21 at 22:08
  • @RickJames added to post, and 29219 rows – Paul Aug 21 '21 at 01:18
  • How much RAM do you have? What is the value of `innodb_buffer_pool_size`? – Rick James Aug 21 '21 at 07:11
  • @RickJames on my AWS RDS server I am not sure (of the innodb_buffer_pool_size). They abstract the my.ini from you. I will investigate tomorrow. On my local XAMPP test server it was set very low: 16M. I just tried setting it to 1024M and it made no difference. As for RAM, on RDS only 2GB, on my local test server 16GB but XAMPP is 32 bit so likely only 4GB. – Paul Aug 23 '21 at 04:57
  • You can check the size of the buffer pool: `SELECT @@innodb_buffer_pool_size;` – Bill Karwin Aug 23 '21 at 16:03
  • SELECT @@innodb_buffer_pool_size on RDS instance: 1073741824 – Paul Aug 23 '21 at 22:11
  • What happens to performance if you omit `ORDER BY id` ? – O. Jones Aug 24 '21 at 10:52
  • @O.Jones No difference at all without the ORDER BY. – Paul Aug 24 '21 at 18:11
  • @RickJames So I tried fiddling with the innodb_buffer_pool_size some more, increasing it to 2GB. I said it had no effect. This is not correct. It has a minor effect. Search times go from 3 to 4 minutes or 3.5 to just under 3 minutes. So it improves performance, but not in any meaningful way. Any search longer than 10 seconds really renders the feature useless. Also on my local Windows system I can see that the query is totally disk bound, with the disk usage pegged at 100% in TaskMan during the searches. It is like I have no index. – Paul Aug 24 '21 at 18:14
  • 1
    Please provide the output from `SHOW TABLE STATUS LIKE "frames"` – Rick James Aug 24 '21 at 21:46
  • What is the cardinality of the table "frames". Can you provide us the index of this table? – Noob Aug 24 '21 at 21:50
  • @RickJames added to post. – Paul Aug 25 '21 at 16:12
  • @RickJames any idea why limiting the scope of the query, with additional WHERE, or with a subquery, or with a LIMIT, has no effect on the time? – Paul Aug 27 '21 at 16:42
  • @Paul - There are a lot of possible answers. A simple example is when you have both `ORDER BY` and `LIMIT`. All possible rows need to be fetched and sorted before dealing out the few rows specified by the `LIMIT`. Even that case _may_ be fast _if_ the `ORDER BY` can be handled by an `INDEX`. I would be happy to explain individual queries; otherwise the question is much too broad. Meanwhile, `FULLTEXT` and `SPATIAL` add extra complications. – Rick James Aug 27 '21 at 17:25
  • @Paul - And this walks through a few simpler cases: http://mysql.rjweb.org/doc.php/index_cookbook_mysql It focuses somewhat on the complexity of `WHERE`. – Rick James Aug 27 '21 at 17:26
  • @Paul - And I added to my Answer to address your subquery/order-by/limit questions. – Rick James Aug 27 '21 at 17:48
  • @RickJames Thanks for all the answers, and time you've spent. – Paul Aug 27 '21 at 21:23
  • I still don't have a solution. I wonder how search engines can search terabytes of text in fractions of a second. Is is just about raw CPU/RAM/disk horsepower? Or perhaps I need to look for alternate solutions to MySQL. – Paul Aug 27 '21 at 21:26
  • It can find a rare name a lot faster than "John". Your example with "1912" somewhat demonstrates that. How big is the table -- in GB? (Do `SHOW TABLE STATUS LIKE 'frames';` if you can.) I may have an explanation after you provide that. – Rick James Aug 27 '21 at 23:12
  • @RickJames I already have included SHOW TABLE STATUS LIKE "frames"; in the question above. – Paul Aug 27 '21 at 23:37
  • I may be on to something. Seems like the innodb_buffer_pool_size may be at play here, but in combination with caching and other things that have been throwing me off the trail. Investigating... – Paul Aug 28 '21 at 23:54

2 Answers2

0
  • Special index? - Yes.
  • Did it use it? - yes; see type=fulltext in the EXPLAIN.
  • Bagillion bogus 'words'? - Sure.
  • Too big to be cached? - Not a problem; the caching is in "blocks".
  • So looking for "John" needs to read only the block(s) with "John", possibly from disk.
  • Now comes the potential problem. For each row with "John", dbRollID needs to be found. This requires a lookup in a couple of BTrees, but since you have only 50K rows, I suspect both of them are well cached.
  • But the real problem is with "John". He occurs on 29219 rows -- about 60% of the rows. In MyISAM, that would trigger "skip the FT index"; I do not know if InnoDB has the same "50% rule".

To test my last theory, please search for some word that occurs in, say, 40% of the rows.

To attempt to get a little more insight, please run

        SELECT  last_update,
                n_rows,
                'Data & PK' AS 'Type',
                clustered_index_size * 16384 AS Bytes,
                ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
                clustered_index_size AS Pages,
                ROUND(n_rows / clustered_index_size) AS 'Rows/page'
        FROM mysql.innodb_table_stats
        WHERE table_name = 'frames'
    UNION
        SELECT  last_update,
                n_rows,
                'Secondary Indexes' AS 'BTrees',
                sum_of_other_index_sizes * 16384 AS Bytes,
                ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
                sum_of_other_index_sizes AS Pages,
                ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
        FROM mysql.innodb_table_stats
        WHERE table_name = 'frames'
          AND sum_of_other_index_sizes > 0
          ;
+---------------------+--------+-------------------+------------+-----------+--------+-----------+
| last_update         | n_rows | Type              | Bytes      | Bytes/row | Pages  | Rows/page |
+---------------------+--------+-------------------+------------+-----------+--------+-----------+
| 2021-08-26 13:03:21 |  35305 | Data & PK         | 3476848640 |     98480 | 212210 |         0 |
| 2021-08-26 13:03:21 |  35305 | Secondary Indexes |    4194304 |       119 |    256 |       138 |
+---------------------+--------+-------------------+------------+-----------+--------+-----------+
2 rows in set (0.161 sec)

and

SELECT * FROM mysql.innodb_index_stats
    WHERE table_name = 'frames';   
+---------------+------------+------------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name       | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+------------------+---------------------+--------------+------------+-------------+-----------------------------------+
| scu_db        | frames     | FTS_DOC_ID_INDEX | 2021-08-26 13:03:21 | n_diff_pfx01 |      61745 |          20 | FTS_DOC_ID                        |
| scu_db        | frames     | FTS_DOC_ID_INDEX | 2021-08-26 13:03:21 | n_leaf_pages |         95 |        NULL | Number of leaf pages in the index |
| scu_db        | frames     | FTS_DOC_ID_INDEX | 2021-08-26 13:03:21 | size         |        160 |        NULL | Number of pages in the index      |
| scu_db        | frames     | PRIMARY          | 2021-08-26 13:03:21 | n_diff_pfx01 |      35305 |          20 | id                                |
| scu_db        | frames     | PRIMARY          | 2021-08-26 13:03:21 | n_leaf_pages |     189726 |        NULL | Number of leaf pages in the index |
| scu_db        | frames     | PRIMARY          | 2021-08-26 13:03:21 | size         |     212210 |        NULL | Number of pages in the index      |
| scu_db        | frames     | fk_roll_id_1     | 2021-08-26 13:03:21 | n_diff_pfx01 |        112 |          20 | dbRollID                          |
| scu_db        | frames     | fk_roll_id_1     | 2021-08-26 13:03:21 | n_diff_pfx02 |      62851 |          20 | dbRollID,id                       |
| scu_db        | frames     | fk_roll_id_1     | 2021-08-26 13:03:21 | n_leaf_pages |         55 |        NULL | Number of leaf pages in the index |
| scu_db        | frames     | fk_roll_id_1     | 2021-08-26 13:03:21 | size         |         96 |        NULL | Number of pages in the index      |
+---------------+------------+------------------+---------------------+--------------+------------+-------------+-----------------------------------+
10 rows in set (0.055 sec)

More on "why" some queries don't optimize well.

WHERE x=1  -- INDEX(x) helps a lot
WHERE MATCH(s) AGAINST(..)   -- Needs FULLTEXT
WHERE MATCH(s) AGAINST(..) AND x=1  -- (below)

When mixing FULLTEXT or SPATIAL and 'regular' indexing, the internals get complex. Simply put, there is no way to do both; it is only practical to do one, then brute-force (no index) on the other.

Hence, that 3rd expression could be done in one of two ways:

  • x=1 first, then MATCH. The problem with this is that FULLTEXT (I think) only works with the entire table. That is, this ordering cannot be done.
  • MATCH first, then x=1. Hence, this is the chosen method. In your example, the MATCH takes a long time. After that, filtering the results against x is relatively fast, in spite of having to check all the rows that the MATCH found.

In a "typical" FULLTEXT application, the FT filtering gets down to very few rows; hence "match first" is "good enough".

You have a subquery example. I can't explain it.

ORDER BY and/or LIMIT don't speed things up because it still has to do the entire MATCH first.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Sorry still waiting on table rebuild to test your answer code on identical data set. Meanwhile I can test on a different (RDS) server, with different numbers unfortunately, but searching more common words does proportionally result in better times: john 41681 rows in set (1 min 42.45 sec), paul 23383 rows in set (42.92 sec), george 36884 rows in set (1 min 5.48 sec), ringo 43 rows in set (0.17 sec) - So is there a way to "skip the FT index" on/before the queries with too many results. I suppose I could just set a 10 second timeout as a hack and report "too many hits" when it times out... – Paul Aug 26 '21 at 18:20
  • @Paul - What version are you running? MariaDB 10.1 and later: https://mariadb.com/kb/en/aborting-statements/ -- That discusses the diffs between 10.1 and 5.7. – Rick James Aug 26 '21 at 19:06
  • on RDS I am running MySQL 5.7.34 and on XAMPP MariaDB 10.1.something. With regard to timeouts, I already know the syntax for each, and can make that work as a temporary hack. But I would still like to find a better solution. – Paul Aug 26 '21 at 19:20
  • @Paul - Please provide `EXPLAIN SELECT ...` for the new-but-not-fast-enough query. I may have missed something subtle. – Rick James Aug 26 '21 at 19:29
  • ok added to question - but again this is a different system - rebuilding my local tables is taking many hours... – Paul Aug 26 '21 at 20:17
  • @Paul - There may well be differences in the Optimizer (between MySQL and MariaDB) for your query. Or even between versions within a product. – Rick James Aug 26 '21 at 23:09
  • Yes but the slow performance I see happens on both platforms. – Paul Aug 27 '21 at 01:43
  • I have run your suggested queries now on the original data set on my local MariaDB, but StackOverflow won't let me edit the answer anymore to add them... But they are basically the same. Only the sizes are different. – Paul Aug 27 '21 at 15:13
  • @Paul - Take the stackoverflow tutorial -- It is quick and gives you a quick 100 rep points, which lets you do a few more things. – Rick James Aug 27 '21 at 17:34
  • Sorry for the stupid question, but where do I find the stack overflow tutorial? – Paul Aug 30 '21 at 18:00
  • Hmmm... I remember taking it years ago. But I can't find it now. – Rick James Aug 30 '21 at 18:33
  • Found it. Only worth 1 point. https://stackoverflow.com/tour – Paul Aug 30 '21 at 22:38
  • Thanks. I guess the 100 got deflated. – Rick James Aug 30 '21 at 22:55
  • Just FYI for completeness the sub-queries do help a lot, once the cache issue is solved. – Paul Aug 31 '21 at 23:36
0

Having more RAM, and increasing innodb_buffer_pool_size, can have a dramatic effect, depending on the size of my table, and with caveats.

On my local test server, with half of my rows in the table, with the buffer set to 2GB, the following happens:

The first search for +john takes 2 minutes. Successive searches for +john take less than a second. Then, searches for any other common word only take 10 to 15 seconds. If I reboot my server, the first search always takes 2 minutes. Successive searches are fast. It seems indexes are getting cached. So when the server reboots that first search will always be slow. I wonder, what other conditions will cause this cache to be flushed, and make the searches slow again. Do I just tell my users, "Oh, the first search you do every day will be horribly slow. Just wait for that and then it will work!" Edit update: Found another SO q/a on this: mysql slow on first query, then fast for related queries It's referred to as warming up the cache!

Also, I had added a timeout to my code, so the web page that uses this won't get locked up. But I can't do this, because if I time out before the long 2 minute first search, the cache never gets filled (if that is in fact what this is) and all searches are slow, and always time out.

Also, if my table gets too big, relative to my RAM and innodb_buffer_pool_size, it is like a threshold gets crossed, and all common word searches start taking minutes, like the cache is too small.

So maybe the answer, or one answer, is I just need a more powerful server, with more RAM. But I still think this shouldn't have to be the answer. Especially since my production data sets will be much bigger.

One more slightly related issue I noticed. If my SELECT requests more columns, especially larger ones, it breaks all this and REALLY slows down the queries even more. So even if I just have to live with this, I am going to have to write smarter and multiple queries at a time to get the data out.

Update: So here is a full list of all the things I did to mitigate this problem:

  • increase the system RAM and the innodb_buffer_pool_size to fit the full index into the cache
  • "warm up" the cache - write code to do a full text search for +john anytime the server restarts
  • remove all words 3 letters or less from the data (maybe this also helps with the cache)
  • modify any queries possible to do sub-queries
  • modify queries to not select any additional columns when doing MATCH AGAINST
  • modify queries to do followup queries to get addition columns
  • add MySQL timeout for worst case scenarios
Paul
  • 13
  • 1
  • 5