3

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.

Here's my variables

+--------------------------+----------------+
| ft_boolean_syntax        | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len          | 84             |
+--------------------------+----------------+
| ft_min_word_len          | 4              |
+--------------------------+----------------+
| ft_query_expansion_limit | 20             |
+--------------------------+----------------+
| ft_stopword_file         | (built-in)     |
+--------------------------+----------------+

I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:

WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)
andyts93
  • 337
  • 9
  • 24
  • 1
    The fulltext index contains words. By default, a period ends a word, and the period is not part of the word. To change that, you can/have to treat a period as a character, see e.g. [here](http://stackoverflow.com/questions/21696235/mysql-full-text-period). But this will treat all periods as characters. So the fulltext index of this comment will e.g. contain "words.". This can, depending on how you implement it, make things more complicated in other tables. Also, since you probably use innodb, the option is *innodb_ft_min_token_size* with default value 3, that's why you can find "24V". – Solarflare Apr 10 '17 at 10:05
  • Thanks for your reply. I've tried to modify the latin1 xml file changing from 10 to 01 the ctype of the character 2E (fullpoint), I've restarted mysql but query with 1.6 doesn't work yet. Have you got any idea? – andyts93 Apr 10 '17 at 10:46
  • Do you use latin1 for your column or e.g. (probably the server default) utf8? Also, you need to reindex your index after you changed settings (you can e.g. drop and recreate it). – Solarflare Apr 10 '17 at 13:18
  • Yes, my table collation is latin1_swedish_ci. I've tried to create new table with a new index. On the old table the query described above returns 0 results, in the new one it returns some rows but it ignores the '+1.6', so I get all the results with 'engine' in the index – andyts93 Apr 10 '17 at 14:55

2 Answers2

1

I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'

We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.

We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.

  • Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.

First things first:

We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period. https://www.eso.org/~ndelmott/ascii.html

Now, we need to find the collation files in the database server.

  1. SSH into your server.
  2. Login into your MySQL/MariaDB: mysql -u root -p
  3. Run Show VARIABLES LIKE 'character_sets_dir'

The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.

At this point, I opened a second terminal, but this is necessary.

Back in the server, outside of the MySQL/MariaDB command line:

  1. Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
  2. Follow the first step in the MySQL Documentation

NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.

  1. Continue to the second step in the MySQL Documentation
  2. After the changes, Restart your server.

Assign the User-defined Collation to our database/table/column.

All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command: ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;

Here are some links that might be helpful: https://mariadb.com/kb/en/setting-character-sets-and-collations/ https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html

This should solve your problem if you don't have any existing data in the table.

If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.

The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.

Run the following query in the mysql/mariadb command line: UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);

You'll need to convert the values of every column which are causing the issue. In my case, it was just one.

Then follow it with: ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;

We are done. We can now search a term with our character, and our database engine will match against it.

Oscar
  • 973
  • 7
  • 10
0

InnoDB solves this problem, MyISAM still persists with this feature/behaviour.

MyISAM works with words like "Node.js" but not with words like "ASP.NET"

The working here

UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.

I found a link here with the below explanation:

Note: Some words are ignored in full-text searches.

  • The minimum length of the word for full-text searches as of follows :
    • Three characters for InnoDB search indexes.
    • Four characters for MyISAM search indexes.
  • Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Pradyut Bhattacharya
  • 5,440
  • 13
  • 53
  • 83