MySQL 5.7 miss matching against two-Chinese-character words

Question

Table tb:

title                       keywords
企业贷款，该如何看懂银行流水     出口退税,贷款,现金流量表,利润表
最高人民法院关于审理民间        企业贷款,合同法,贷款
中国宏观经济数据点评           企业贷款,贷款

ngram_token_size=2
FULLTEXT INDEX `keywords_title`(`keywords`, `title`)

Now I match keywords and title with a four-Chinese-character word '企业贷款'

select * from tb where MATCH (keywords, title) AGAINST ('企业贷款');

|[
title                       keywords
企业贷款，该如何看懂银行流水     出口退税,贷款,现金流量表,利润表
最高人民法院关于审理民间        企业贷款,合同法,贷款
中国宏观经济数据点评           企业贷款,贷款
]|

But When I match them with a two-chinese-character word '贷款', no result matched.

select * from tb where MATCH (keywords, title) AGAINST ('贷款');
|[
]|

Can you add only "贷款" word comma separated and test query again? "贷款" word is not present in MATCH — Mangesh Sathe, Apr 17 '18 at 14:47
TRY https://stackoverflow.com/questions/19777359/mysql-query-match-against-using-wildcard — Mangesh Sathe, Apr 17 '18 at 15:04
select * from tb where MATCH (keywords, title) AGAINST ('*贷款' IN BOOLEAN MODE); — Mangesh Sathe, Apr 17 '18 at 15:27
What's the point of a FULL TEXT index if wildcards are necessary? — Anthony, Apr 17 '18 at 16:04
It's because full text searches are "natural language" searches by default, meaning it matches based on "full words", not sub-strings of words. I tested with your data but replacing `企业贷款` with `test`. `MATCH (keywords, title) AGAINST ('test')` returns the same results, but `MATCH (keywords, title) AGAINST ('est')` returns no results. So this is not multibyte-specific. — Anthony, Apr 17 '18 at 16:22
@MangeshSathe - Full text searches don't support prepended wildcards. Your suggestion won't work. Nothing will work. This is working by design. Full Text searches don't do partial-string matches, as it is a "natural language" search. To achieve stated goal, `LIKE` should be used instead. — Anthony, Apr 17 '18 at 16:29

score 0 · Answer 1 · edited Apr 17 '18 at 16:04

0

ALTER TABLE tb
DROP INDEX `keywords_title ` ,
ADD FULLTEXT INDEX `keywords_title ` (`keywords` ASC, `title` ASC)  /*!70000 WITH PARSER `ngram` */

/*!70000 WITH PARSERngram*/ is necessary!

edited Apr 17 '18 at 16:04

Anthony

36,459
25
97
163

answered Apr 17 '18 at 15:57

AarioAi

563
1
5
18

CryMasK · Answer 2 · 2019-07-26T14:45:49.357

/*!70000 WITH PARSER ngram*/

It mean the syntax will only be executed when the MySQL version >= 7.0.0, so you didn't use the ngram parser at all.

In my advice, you still need to use the ngram parser, because the default full-text parser can't handle the Chinese string well (This also mentioned in the document).

I ran some tests on MySQL 5.7 & 8.0 server, got same results:

For MyISAM

X: return empty result  O: return rows contain the pattern

built-in parser
  IN NATURAL LANGUAGE MODE (default)
    企业贷款    X
    贷款        X
  IN BOOLEAN MODE
    企业贷款    O
    贷款        X

ngram
  IN NATURAL LANGUAGE MODE (default)
    企业贷款    X
    贷款        X
  IN BOOLEAN MODE
    企业贷款    O
    贷款        O

For InnoDB

built-in parser
  IN NATURAL LANGUAGE MODE (default)
    企业贷款    O
    贷款        X
  IN BOOLEAN MODE
    企业贷款    O
    贷款        X
ngram
  IN NATURAL LANGUAGE MODE (default)
    企业贷款    O
    贷款        O
  IN BOOLEAN MODE
    企业贷款    O
    贷款        O

First, if you search in natural language mode, there is a 50% threshold in MyISAM engine. Both 企业贷款 and 贷款 occur in more than half of the rows, MySQL will treat them as stopwords (doc), so it return nothing.

Next, if you use the built-in parser, it causes strange results, since the parser can't process these ideographic languages well.

There are two reasons for your situation:

Not like ngram, the built-in parser only can use some symbols (in this case, is comma) to parse a Chinese sentence. The parser only catches 企业贷款 token and doesn't catch it's sub-string 贷款.
The built-in parser treats a Chinese token in the same way as it treats an English token. The length of 贷款 is smaller than the minimum length of a word that considered to index (check ft_min_word_len setting for MyISAM, innodb_ft_min_token_size for InnoDB), so it will not be indexed even though 贷款 occurs in the keywords field many times.

Qiulang · Answer 3 · 2022-10-27T08:31:29.403

The MYSQL document explains the purpose of ngram Full-Text Parser

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK).

ngram_token_size=2 (the default value) let you search the 2 characters Chinese words in your case.

/*!70000 WITH PARSER ngram */ means if MySQL version >= 7.0.0 executes it. FTS was introduced in mysql 5.6, so /*!50600 WITH PARSER ngram */ should be enough.

MySQL 5.7 miss matching against two-Chinese-character words

3 Answers3

For MyISAM

For InnoDB

Linked