4

I have a table with column content that has a FULLTEXT index.

I want to take advantage of the speed of MATCH() on large text.

I want the search to be as accurate as possible.

When I search for the phrase string "large truck" this way:

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck"' IN BOOLEAN MODE);

Some instances are missed.

MyTable:

|   content    |
----------------
|Large \n truck| FOUND ✓
----------------
|large truck   | FOUND ✓
----------------
|large trucks  | *PLURAL MISSED!
----------------
|large truckl  | *TYPE-O MISSED!

If I use the standard LIKE / wildcard method:

SELECT * FROM  `MyTable` WHERE  `content` LIKE  '%large truck%'

MyTable:

|   content    |
----------------
|Large \n truck| *MISSED!
----------------
|large truck   | FOUND ✓
----------------
|large trucks  | FOUND ✓
----------------
|large truckl  | FOUND ✓

It seems I can't use a PHRASE search with wildcard together either:

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck*"' IN BOOLEAN MODE); **DOES NOT WORK**

OR

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck"*' IN BOOLEAN MODE); **DOES NOT WORK**

So...

How do I successfully use MATCH() AGAINST() searching for a phrase, and get all instances returned - even case insensitive partial string matches?

2 Answers2

5

The trick that I often use with FT is to do it in two steps:

  1. Do a MATCH, hoping to get all the desired text, but possibly some extra results.
  2. AND with another condition -- LIKE (faster) or REGEXP (more powerful).

The MATCH will be fast because of FT; the other part will be performed second, so it will be fast because there aren't many rows to check.

This matches your criteria:

SELECT * FROM MyTable
    WHERE MATCH(content) AGAINST('+large +truck*' IN BOOLEAN MODE)
      AND content REGEXP "large[[:space:]]+truck";

Phrased differently, the query will run something like this:

  1. Let's say the table has 10K rows.
  2. The FT expression will be evaluated. The MATCH will be very fast (because of the way it is designed). It will find all rows with both "large" and "truck*" anywhere in content. Now, let's say, there are 30 rows that satisfy that.
  3. The rest of the WHERE is evaluated. But it is done against only those 30 rows. So, even though the REGEXP is costly, it is not being done often.
  4. Then perhaps 14 rows are returned.

The net effect is that the whole query runs "fast", which was one of your requirements.

Note: I needed the second part to prevent these

large green truck
the truck is large

Depending on the version, you may need this REGEXP instead: "large\\s+truck".

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks. I'm trying to understand the logic that "the other part will be performed second, so it will be fast because there aren't many rows to check." Why will "the other pert be faster"? I am under the impression that MATCH() would be the fastest way on FULLTEXT index. And I step away from using LIKE for that reason. Now using a combination of both LIKE and MATCH - seems like it doubles my time then? And is there something conditional about performing a LIKE after a MATCH, that makes the LIKE faster? Thanks. Just trying to understand. –  Jul 17 '20 at 20:30
  • @Stnfordly - I added a rewording. Yes, adding the REGEXP might take twice as long as just the FT part. But it is only twice as long. Doing only the regexp might be 100 times a slow. And doing only the FT would give you too few rows (see your question) or too many rows (as per my version). (I picked REGEXP because I did not find a useful LIKE for this particular case.) – Rick James Jul 17 '20 at 21:38
  • Hi. Thanks. Are you saying that your query example will first do the MATCH() and then perform the REGEX only on that which it found in the MATCH()? –  Jul 17 '20 at 23:56
  • I'm going to try this - but will the regex not fail to find the "|Large \n truck|" instance? –  Jul 18 '20 at 21:02
  • 1
    @Stnfordly - If `\n` is truly a Carriage Return, then it is treated as "space" (aka whitespace); if it is two characters (backslash and n), then things get messier. – Rick James Jul 18 '20 at 22:31
  • Possibly \r as well –  Jul 19 '20 at 23:06
  • Oops -- `\r` is Carriage Return; `\n` is newline (Line Feed); both are "white space" when it comes to REGEXP. (LIKE has no test for "whitespace") – Rick James Jul 20 '20 at 00:18
1

Here is a quick hack with REGEXP, but it's not solving the issue as it does not use the fulltext index:

SELECT * 
FROM MyTable 
WHERE content REGEXP("large[[:space:]]+truck*");

Of course you can also use the FT index by searching without exact phrase:

SELECT * 
FROM MyTable 
WHERE MATCH(content) AGAINST('+large +truck*' IN BOOLEAN MODE);

But this will eventually include records that you do not want as it is not the same as exact phrase search.

Unfortunately phrase search (double quotes - "") and truncation operator (wildcard - *) will not work regardless if you use InnoDB or MyISAM. It does not work with the distance operator with InnoDB too (which is probably coming from the same origin). I guess it is related in the nature how data of the full text index is stored.

Philip Petrov
  • 975
  • 4
  • 8