5

Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?

In english it can be done by:

SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'

or even a "dirty" solution like:

SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"

Is there a similar solution for eastern languages / CJK characters?

I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.

The words are stored in a utf-8 string field.

If this cannot be done in mysql, can it be done in PHP?

Thanks! :)

edit 1: The data does not include in which language the string is therefore I cannot filter by another field. edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.

DRosenfeld
  • 115
  • 1
  • 9
3seconds
  • 101
  • 1
  • 6
  • 1
    1) Transform your string into raw codepoints (e.g. UCS-4). 2) check each character if it's within your desired range. For CJK glyphs you may be lucky and they actually for one contiguous range (or at least only a handful). – Kerrek SB Jul 06 '11 at 11:24
  • This is similar, but not identical to, http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php – Arafangion Jul 06 '11 at 12:00

4 Answers4

3

Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:

Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets.

Fortunately in PHP you can build such a regexp e.g. with

/[\x{1234}-\x{5678}]*/u

(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).

borrible
  • 17,120
  • 7
  • 53
  • 75
  • A regular expression alone may prove to be remarkably ineffective given that the characters used are very similar. I think you would need to use, as a minimum, some sort of statistics. – Arafangion Jul 06 '11 at 11:55
  • @Arafangion - Hangul characters are used by Korean only and Katakana characters only for Japanese. Only potential ambiguity if the Chinese characters where, admittedly, some second-order check might be required. – borrible Jul 06 '11 at 12:01
  • @Arafangion - Indeed, but then as mentioned in my response, this enables the picking of the script and not directly the language. It may not be applicable as an entire solution - depending on the (unspecified) nature of the original poster's data and accuracy requirements. – borrible Jul 06 '11 at 12:08
  • You're correct there, so your question may well be the solution that the OP picks, much to my chagrin. – Arafangion Jul 06 '11 at 12:10
  • @borrible - accuracy requirements can be 80%-90%. Can the problem be solved with regex in such accuracy? – 3seconds Jul 06 '11 at 14:02
  • @user831405 - It will depend on your data. If, for example, your Korean text only ever users Hangul then this will give you very accurate results for detecting Korean. – borrible Jul 06 '11 at 14:04
1

You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.

But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.

That said, here are some useful links you may be interested in:

The latter is relatively complex to implement, but yields a much better result.

Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.

An interesting paper that should demonstrate the futility of this excercise is:

Finally, you ask:

If this cant be done in mysql - how can it be done in PHP?

It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.

Community
  • 1
  • 1
Arafangion
  • 11,517
  • 1
  • 40
  • 72
0

you may consider another data structure that contains the words and or characters, and the language you want to associate them with.

the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.

Randy
  • 16,480
  • 1
  • 37
  • 55
0

Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.

Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.

Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!

There are some heuristics though.

Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.

Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.

But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.

One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!

And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.

hippietrail
  • 15,848
  • 18
  • 99
  • 158