Searching Geoname database with non-latin characters

Question

I have a copy of the Geonames database stored in a MySQL database, and a PHP application that allows users to search the database for their city. It works fine if they type the city name in English, but I want them to be able to search in their native language.

For example, instead of asking a Japanese speaker to search for Tokyo, they should be able to search for 東京.

The Geonames database contains an alternatenames column with, "alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)."

For example, the alternatenames value for the Tokyo row is Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,TokiÃ³,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - æ±äº¬,TÃ²quio,TÃ³kÃ½Ã³,TÃ³quio,TÅkyÅ,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Î¤ÏŒÎºÎ¹Î¿,Ð¢Ð¾ÐºÐ¸Ð¾,Ð¢Ð¾ÐºÑ‘,Ð¢Ð¾ÐºÑ–Ð¾,ÕÕ¸Õ¯Õ«Õ¸,×˜×•×§×™×•,ØªÙˆÙƒÙŠÙˆ,ØªÙˆÚ©ÛŒÙˆ,Ø·ÙˆÙƒÙŠÙˆ,Ü›Ü˜ÜŸÜÜ˜,ÜœÜ˜ÜŸÜÜ˜,à¤Ÿà¥‹à¤•à¥à¤¯à¥‹,à®Ÿà¯‹à®•à¯à®•à®¿à®¯à¯‹,à¹‚à¸•à¹€à¸à¸µà¸¢à¸§,áƒ¢áƒáƒ™áƒ˜áƒ,ä¸œäº¬,æ±äº¬,æ±äº¬éƒ½,ë„ì¿„.

Those values don't contain 東京 exactly, but I'm guessing that they contain a form of it that has been encoded or converted in some way. So, I assuming that if I perform the same encoding/conversion on my search string, then I'll be able to match the row. For example:

mysql_query( sprintf( "
    SELECT * FROM geoname 
    WHERE 
        MATCH( name, asciiname, alternatenames ) 
        AGAINST ( %s )  
    LIMIT 1",
    iconv( 'UTF-8', 'ASCII', '東京' )
) );

The problem is that I don't know what that conversion would be. I've tried lots of combinations of iconv(), mb_convert_string(), etc, but with no luck.

The MySQL table looks like this:

CREATE TABLE `geoname` (
 `geonameid` int(11) NOT NULL DEFAULT '0',
 `name` varchar(200) DEFAULT NULL,
 `asciiname` varchar(200) DEFAULT NULL,
 `alternatenames` mediumtext,
 `latitude` decimal(10,7) DEFAULT NULL,
 `longitude` decimal(10,7) DEFAULT NULL,
 `fclass` char(1) DEFAULT NULL,
 `fcode` varchar(10) DEFAULT NULL,
 `country` varchar(2) DEFAULT NULL,
 `cc2` varchar(60) DEFAULT NULL,
 `admin1` varchar(20) DEFAULT NULL,
 `admin2` varchar(80) DEFAULT NULL,
 `admin3` varchar(20) DEFAULT NULL,
 `admin4` varchar(20) DEFAULT NULL,
 `population` int(11) DEFAULT NULL,
 `elevation` int(11) DEFAULT NULL,
 `gtopo30` int(11) DEFAULT NULL,
 `timezone` varchar(40) DEFAULT NULL,
 `moddate` date DEFAULT NULL,
 PRIMARY KEY (`geonameid`),
 KEY `timezone` (`timezone`),
 FULLTEXT KEY `namesearch` (`name`,`asciiname`,`alternatenames`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4

Can anyone point me in the right direction?

And stop using those outdated, unmaintained, deprecated, and insecure [mysql_* functions](http://stackoverflow.com/questions/12859942/why-shouldnt-i-use-mysql-functions-in-php). — miken32, Mar 03 '17 at 00:17
@miken32: I'm not, it was just a minimized example for simplicity. — Ian Dunn, Mar 03 '17 at 00:33
That looks precisely like UTF8 being interpreted as ISO8859-1. See @miken32's answer. — Sammitch, Mar 03 '17 at 02:04
@Sammitch: Yep. That's exactly what it looks like, somewhere along the line. From the question, we can't tell *where* exactly that happened. Could be everything was perfect up to the copy-paste into StackOverflow, But my suspicion is that the mojibake happened somewhere earlier, likely when loading the data into the table. — spencer7593, Mar 03 '17 at 02:24

miken32 · Accepted Answer · 2017-09-13T18:39:03.497

3

When I download the Japan file and set up a database like this:

CREATE TABLE geonames (
    geonameid SERIAL,
    name varchar(200),
    asciiname varchar(200),
    alternatenames varchar(10000),
    latitude float,
    longitude float,
    featureclass varchar(1),
    featurecode varchar(10),
    countrycode varchar(2),
    cc2 varchar(200),
    admin1code varchar(20),
    admin2code varchar(80),
    admin3code varchar(20),
    admin4code varchar(20),
    population BIGINT,
    elevation INT,
    dem INT,
    timezone varchar(40),
    modificationdate DATE
    ) CHARSET utf8mb4;

Then I load the data like this:

LOAD DATA INFILE '/tmp/JP.txt' INTO TABLE geonames CHARACTER SET utf8mb4;

And select it like this:

SELECT alternatenames FROM geonames WHERE geonameid=1850147\G

I get this:

*************************** 1. row ***************************
alternatenames: Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - 東京,Tòquio,Tókýó,Tóquio,Tōkyō,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,Տոկիո,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟܝܘ,ܜܘܟܝܘ,टोक्यो,டோக்கியோ,โตเกียว,ტოკიო,东京,東京,東京都,도쿄

I can also do a search like this:

SELECT name FROM geonames WHERE alternatenames LIKE '%,東京,%';

Which is a long way of saying: Note the charset declaration when I created the table. I believe this is what you failed to do when you created your database.

edited Sep 13 '17 at 18:39

answered Mar 03 '17 at 00:35

miken32

42,008
16
111
154

1

+10. We note that it's possible to create the table with a "correct" characterset, and still wind up with content mojibake. With the `LOAD DATA INFILE`, MySQL Server uses the characterset of the `character_set_database` system variable, when the characterset is not specified in the `LOAD DATA INFILE` statement. – spencer7593 Mar 03 '17 at 00:59
1

`character_set_database` is based on the currently selected database, so it *should* be okay unless you do `LOAD...INTO otherdb.geonames`. Regardless, it doesn't hurt, and I've edited the question to explicitly include it in the statement. Thanks for the tip. – miken32 Mar 03 '17 at 01:57
1

I'm wondering... does the FULLTEXT search also return the expected rows, using `MATCH` as in OP query? (I'm not going to test it here, I've not setup a demonstration.) MySQL Reference Manual notes a limitation with "built-in full-text parser" related to "ideographic languages such as Chinese and Japanese". (I don't suspect, and I'm not suggesting that the full-text parser is the problem. I'm just curious if that could be an additional issue. It seems like the comma characters would be sufficient "stops" between words. Just curious to confirm if that's the case but not enough to test myself) – spencer7593 Mar 03 '17 at 02:17
1

It does not. Adding fulltext index, I don't get any results searching for Japanese characters. – miken32 Mar 03 '17 at 02:18
1

The failure of a FULLTEXT search to find a match could also be due to a stop word (but very unlikely that **"東京"** would be in the stop word list), or (much more likely), the number of characters is less than the minimum size of a token `innodb_ft_min_token_size`, `ft_min_word_len`. – spencer7593 Mar 03 '17 at 02:31
The charset was the problem, thanks! I ran `ALTER TABLE geoname CONVERT TO CHARACTER SET utf8mb4;` and that fixed it. – Ian Dunn Mar 03 '17 at 18:46
The search doesn't work for me either, but I confirmed the data is correct now with a `LIKE` query. I'm working on the search problem next. – Ian Dunn Mar 03 '17 at 18:49
1

@IanDunn based on the link in spencer7593's answer it looks like FT search *may* work if you have MySQL 5.7.6 or later. I'm on MariaDB 10.0 so can't test that. I did confirm that my `ft_min_word_len` was set to 2, so that wasn't the problem. Glad I could help get you most of the way there. – miken32 Mar 03 '17 at 19:06
1

I did end up using a `LIKE` query as a fallback: https://meta.trac.wordpress.org/changeset/5277/ . I restricted the searched rows by `country` (which is indexed) in order to make it performant. If you're wondering why I didn't upgrade to MySQL 5.7.6+, don't ask :) – Ian Dunn Apr 07 '17 at 05:27

spencer7593 · Answer 2 · 2017-03-03T00:33:29.283

Recommended reading:

https://www.joelonsoftware.com/articles/Unicode.html

http://kunststube.net/encoding/

In terms of MySQL, what is of critical importance is the characterset of the MySQL connection. That's the characterset that MySQL Server thinks the client is using in its communication.

SHOW VARIABLES LIKE '%characterset%'

If that isn't set right, for example, the client is sending latin1 (ISO-8859-1), but MySQL server thinks it's receiving UTF8, or vice versa, there's potential for mojibake.

Also of importance is the characterset of the alternatenames column.

One issue dealing with multibyte character set is going to be the PHP sprintf function. Many of the string handling functions in PHP have "mutlibyte" equivalents , that correctly handle strings containing multibyte characters.

https://secure.php.net/manual/en/book.mbstring.php

Unfortunately, there is no builtin mb_sprintf function.

For a more detailed description of string handling in PHP including multibyte characters/charactersets:

https://secure.php.net/manual/en/language.types.string.php#language.types.string.details

excerpt:

Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data and using instead the functions that do behave correctly, generally from the intl and mbstring extensions. However, using functions that can handle Unicode encodings is just the beginning. No matter the functions the language provides, it is essential to know the Unicode specification.

Also, a google search of "utf8 all the way through" may return some helpful notes. But be aware that this mantra is not a silver bullet or panacea to the issues.

Another possible issue, noted in the MySQL Reference Manual:

https://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

13.9.5 Full-Text Restrictions

Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the built-in full-text parser cannot determine where words begin and end in these and other such languages.

In MySQL 5.7.6, a character-based ngram full-text parser that supports Chinese, Japanese, and Korean (CJK), and a word-based MeCab parser plugin that supports Japanese are provided for use with InnoDB and MySIAM tables.

Searching Geoname database with non-latin characters

2 Answers2