3

I'm familiar with the different MySQL collations, my DB is currently set UTF8_general_ci, and I was searching semi-successfully using UTF8_unicode_ci. I'm running mysql 5.6, and I can't use the newest collations, without a headache in upgrading.

I say semi-successfully, as is returning when using UTF8_unicode_ci; I can live with that if I can solve the big issue below.

What I'm trying to achieve is searching my city column and returning a row, so Yokohama and 横浜 and よこはま and ヨコハマ all return the same row in querying the DB. Is this possible?

Penny Liu
  • 15,447
  • 5
  • 79
  • 98
mcgruff14
  • 318
  • 2
  • 15
  • It's been a while, but as far as I remember this is very difficult without storing a kana reading for the kanji. – Robby Cornelissen Jul 24 '18 at 04:15
  • It seems like most of the collations only allow for hiragana/katakana conversions, not to English and not to Kanji (at least without the Kana reading, I read in a blog about that). Is that correct? – mcgruff14 Jul 24 '18 at 04:23
  • Like I said, it's been a couple of years since I worked on this, so I'm a bit out of the loop and don't really know what's available in terms of collations at this point. If you were to implement things yourself, the biggest problem I see are the kanji conversions. To some extent you could probably rely on a kanji dictionary, but for things like personal names it would become very tricky. The romaji conversions should be fairly trivial to implement, although there are a couple of annoying edge cases (e.g. zu → ず or づ). – Robby Cornelissen Jul 24 '18 at 04:33

2 Answers2

1

This is a difficult problem, and I suspect you can't easily solve it just using MySQL.

There's a program called Migemo which deals with this problem to a certain extent in different contexts. It generates a regex to match unconverted input to a document. You can see an online demo of one version with output here. For example, the regex for the input "toukyou" is:

[とトト][うウウ][きキキ][ょョョ][うウウ]|とうきょう|当協会|東京|東教|toukyou|([tţťŧŢŤŦ]|t[¸ˇ-])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))([kķĸǩĶǨ]|k[¸ˇ])([yỳýÿŷỲÝŸŶ]|y[ˋ`ˊ´¨ˆ^])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))

Since Migemo is for matching from partial non-converted input, it expects ASCII input, so you can't use it as-is. However, the basic strategy of Migemo - convert your input to a regex matching all variants - can be used to search your MySQL database.

As a commenter mentioned this is hard if you want to support place names with idiosyncratic romanization; do you need to recognize both "Shinbashi" and "Shimbashi", for example? Even "Tokyo" requires an exception if you're just working from hiragana input. For this I recommend you build your own list, potentially using romaji data from JP Post. It won't deal with all variations but it will get the standard ones.

Hope that helps.

polm23
  • 14,456
  • 7
  • 35
  • 59
1

I used libkakasi in the past, though I collated in application space, not in the database.

› echo -n '横浜' | kakasi -i utf8 -o utf8 -JH
よこはま
› echo -n 'ヨコハマ' | kakasi -i utf8 -o utf8 -KH
よこはま

You could normalise everything to hiragana first, store these strings additionally in the database, perform collation on them with an existing Unicode collation, and when you found a match, fetch the corresponding original unnormalised string instead.

daxim
  • 39,270
  • 4
  • 65
  • 132