Validating Kana Input

Question

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).

There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.

The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.

Ideas?

score 6 · Answer 1 · answered Dec 23 '08 at 07:39

Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)

Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF

Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.

For kanji, I would expect it to be a little more complicated, as I expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)

score 6 · Accepted Answer · answered Dec 23 '08 at 07:40

6

It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.

If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.

answered Dec 23 '08 at 07:40

Jon Skeet

1,421,763
867
9,128
9,194

Jon, you wouldn't happen to have the source handy, would you? – Zack The Human Nov 26 '09 at 04:50
@Zack: Follow the link and you can download it :) – Jon Skeet Nov 26 '09 at 07:19

Assembler · Answer 3 · 2009-04-24T07:12:54.810

2

oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.

regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation

--edit--

$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';

I found this here, but it's not great... I'll keep looking

--edit-- I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

edited Apr 24 '09 at 07:12

answered Apr 24 '09 at 06:49

Assembler

794
1
8
24

I used to use the same range for kanjis(4E00~9FAF), but checked it in the unicode charts and found that the full range is a bit larger: 4E00~9FFF. Though, it probably contains characters not used (anymore?) in the Japanese language. – d-_-b Nov 03 '10 at 04:54
Writing Japanese characters in the source file is a bad practice. – zawhtut Jan 10 '13 at 04:21

Validating Kana Input

3 Answers3

Linked