Validate Japanese Character in Active Record Callback

Question

I have a Japanese project that needs to validate a half width and full width Japanese character, 14 chars are allowed on half width and 7 characters on full width.

Is there anyone who knows how to implement that?

Right now on my model

class Customer
   validates_length_of :name, :maximum => 14
end

is not a good choice

I'm currently using ror 2.3.5 Both fullwidth and halfwidth can be used

Not sure what half width and full width Japanese characters are. Is it something I have to understand to answer the question? — Jason Kim, Mar 26 '13 at 07:04
@garbagecollection Most likely, yes. [This answer seems related](http://stackoverflow.com/a/4684278/477878) even if it is an answer about font width for adjustment. — Joachim Isaksson, Mar 26 '13 at 07:06
yes it was about the font i think halfwidth is 2:1 ratio while fullwidth is 1:1 ratio, I check the bytes size but both are 3 — valrecx, Mar 26 '13 at 07:28
Could you edit your question with the answers to these questions: 1) What Ruby/Rails version are you using? 2) What is the database encoding? 3) Is there the chance of having both half and full width characters in the `name`? 4) Are the [dakuten](http://en.wikipedia.org/wiki/Dakuten) of the half-width characters to be considered a separate character with regards to determining string length? 5) Although there's the possibility that half-width characters will be entered by a user, do you actually need to store them, or can you convert them to full-width before saving them to the database? — Paul Fioravanti, Mar 26 '13 at 08:29
I can't tell more about the database because I am not allowed to access it. All I can do is code. but I'm just starting rails, other languages like php has a functionality to check for this. The requirement is just save what user has entered on the text field, fullwidth and halfwidth. Halfwidth is allowed for 14 characters and fullwidth only 7 characters. I can't checked the bytesize since both outputs 3 bytes each. — valrecx, Mar 26 '13 at 09:57
This japanese charactor is halfwidth 'ｺ' while this one is fullwidth '速' but if I use .length or bytesize, each returned 3 bytes — valrecx, Mar 26 '13 at 10:00
You will likely get different values for `"ｺ".length` and `"速".length` etc depending on your Ruby version, which is why I asked (are you 1.8.7? 1.9.x?). I ask about database encoding because even though Rails assumes UTF-8, projects I've been involved with in Japan don't, and even insist on using Shift-JIS/EUC-JP/ISO-2022-JP data, which then brings in fun encoding issues. Since you're being asked to actually handle half-width characters in the first place, I wouldn't assume UTF-8. Anyway, if I were you, I'd ask your manager/customer how they feel about Q3-5 above and formulate a strategy. — Paul Fioravanti, Mar 26 '13 at 10:52
Also, if both full and half-width characters can be in the same string (which seems to me very irregular unless I'm reading your question/edit wrong), how can you determine its length to validate against? — Paul Fioravanti, Mar 26 '13 at 10:54
In such use-case, I assume that the half-width and full-width point half-width-kana and full-width-kana, Japanese katakana letters are assigned to two different codes, by historical reasons. It is not font but think like that there are two a-z sets (well, you have capital and small, it is a bit similar but not exact same). In older encoding like Shift-JIS or EUC-JP, having different max length might have some reasons because half-width-kana are in smaller number of bytes, but if it is UTF-8 and new system, that specification seems badly made. You just set, say 14 letters max in UTF-8. — akky, Mar 27 '13 at 02:43
And if there are no silly limitation on your database table, maybe because migration from legacy system, I suggest you to ask your client that instead of assign different maximum length (and add warning on the form fields), convert all Kana to either half-width or full-width(I would take this), then only handle one variations of kana in model. That will make better user experience. — akky, Mar 27 '13 at 02:52
If the reason for this length restriction has to do with conversion to a system expecting a legacy encoding, such as Shift-JIS, with an arbitrary byte length restriction, I'd recommend a custom validator that simply converts to that encoding and does a count. You can write an additional check that ensures the content is in the appropriate ranges for katakana. — JasonTrue, Mar 27 '13 at 03:25
@Paul Fioravanti, since my client need it immediately we both decided to just limit the byte size to 21 on the server side, and limit the max length of text area to 14 characters, maybe this is not the best solution but he accepted it. Moji library is worth trying. Thanks to all — valrecx, Apr 07 '13 at 09:15
@valrecx, thanks for following up! If your requirements end up being refined or changed, please update this thread as I know I'm not the only one interested in how stuff like this gets implemented in Rails in real world Japan. — Paul Fioravanti, Apr 07 '13 at 09:37

jogojapan · Answer 1 · 2013-03-27T03:26:15.733

First of all, the concept of fullwidth (全角） and halfwidth （半角） exists only for two types of characters in Japanese:

Roman characters (i.e. Latin)
Katakana characters

A similar concept exists for Korean Hangul, but not for Japanese Hiragana, nor for Kanji.

For Katakana, half-width characters have their own Unicode code points, and they are rendered half the size of full-width characters, although they are identical in shape otherwise. Example:

Fullwidth "ka": カ
Halfwidth "ka": ｶ

Combined characters (i.e. with diacritics like ガ) do not exists in halfwidth versions; they must be encoded as two separate characters: ｶ + ﾞ, which is probably the reason why in your task twice as many characters are allowed for halfwidth. (Note that these combinations of two code points are regarded as combining characters and usually rendered as one.)

For Roman (Latin) characters, the usual ASCII characters are called halfwidth, but the Japanese code range of Unicode (as well as traditional Japan-specific character sets) provide a separate code range for fullwidth versions. Example:

Fullwidth: Ｌ
Halfwidth: L

Fullwidth versions do not exist for non-ASCII Latin-derived characters (such as German umlauts), nor for accented versions. They do, however, exist for numerals and some punctuation characters.

Again, Hiragana and Kanji have no halfwidth versions.

To check whether a character is a fullwidth or halfwidth character, compare the code point to the relevant code range. The ranges are as follows:

Halfwidth Katakana: 0xff61 through 0xff9f
Fullwidth Katakana: 0x30a0 through 0x30ff
Halfwidth Roman: 0x21 through 0x7e (this is ASCII)
Fullwidth Roman: 0xff01 through 0xff60
Hiragana: 0x3041 through 0x309f
Kanji (i.e. the unified-ideographs range): 0x4e00 through 0x9fcc

Here is a simple Ruby program that performs the checks on a per-character basis:

# -*- coding: utf-8 -*-

def is_halfwidth_katakana(c)
  return (c.ord >= 0xff61 and c.ord <= 0xff9f)
end

def is_fullwidth_katakana(c)
  return (c.ord >= 0x30a0 and c.ord <= 0x30ff)
end

def is_halfwidth_roman(c)
  return (c.ord >= 0x21 and c.ord <= 0x7e)
end

def is_fullwidth_roman(c)
  return (c.ord >= 0xff01 and c.ord <= 0xff60)
end

def is_hiragana(c)
  return (c.ord >= 0x3041 and c.ord <= 0x309f)
end

def is_kanji(c)
  return (c.ord >= 0x4e00 and c.ord <= 0x9fcc)
end

text = "Hello World、こんにちは、半角ｶﾀｶﾅ、全角カタカナ、ｆｕｌｌｗｉｄｔｈ ０－９\n"

text.split("").each do |c|
  if is_halfwidth_katakana(c)
    type = "halfwidth katakana"
  elsif is_fullwidth_katakana(c)
    type = "fullwidth katakana"
  elsif is_halfwidth_roman(c)
    type = "halfwidth roman"
  elsif is_fullwidth_roman(c)
    type = "fullwidth roman"
  elsif is_hiragana(c)
    type = "hiragana"
  elsif is_kanji(c)
    type = "kanji"
  end

  printf("%c (%x) %s\n",c,c.ord,type)
end

Further notes

The code ranges above are the official Unicode ranges for each character type (see Unicode Fullwidth forms and Unicode Hiragana). These include certain fullwidth / halfwidth versions of characters that are old / traditional forms or special punctuation characters. If you only want characters that are commonly used in web forms (e.g. for people to enter their names), you might want to narrow the ranges a bit.
Recommendation: If this is for a web form where people can enter their names, you might want to do a little more than just check for half-width or full-width. It is extremely common on Japanese websites and registration forms, esp. with banks, to require that people enter their name in pure halfwidth (typically for Latin) or pure fullwidth (typically for Katakana). Unfortunately, this makes entering data very inconvenient. When the Japanese input method is enabled, Latin characters often come out in fullwidth versions, and the web form will then reject the data because it isn't pure halfwidth. Rather than rejecting it, it should automatically convert it to whatever form it needs. You can easily implement this by translating from one code range to the other (simply by adding the relevant constant), and make people's lives much easier.

score 5 · Accepted Answer · answered Mar 27 '13 at 07:21

The following code may just push you over the line to fulfil the exact requirement you've so far specified in the least possible time. It uses the Moji gem (Japanese documentation), which gives lots of convenience methods in determining the content of a Japanese language string.

It validates a maximum of 14 characters in a name that only consists of half-width characters, and a maximum of 7 characters for names otherwise (including names that contain a combination of half- and full-width characters i.e. the presence of even one full-width character in the string will make the whole string be regarded as "full-width").

class Customer 

  validates_length_of :name, :maximum => 14, 
    :if => Proc.new { |customer| half_width?(customer.name) }
  validates_length_of :name, :maximum => 7
    :unless => Proc.new { |customer| half_width?(customer.name) }

  def half_width?(string)
    Moji.type?(string, Moji::HAN_KATA)
  end

end

Assumptions made:

Data encoding within the system is UTF-8, and gets stored as such in the database; any further necessary re-encoding (such as for passing the data to a legacy system etc) is done in another module.
No automatic conversion of half-to-full width characters done before data is saved to database i.e. half-width characters are allowed in the database for reasons perhaps of legacy system integration, proper preservation of actual user input(!), and/or aesthetic value of half-width characters(!)
Diacritics in half-width characters are treated as their own separate character (i.e. no parsing of ｶ and ﾞ to be considered one character for purposes of determining string length)
There is only one name field as you specify and not, say, four (for surname, surname furigana, given name, given name furigana) which is quite common nowadays.

Validate Japanese Character in Active Record Callback

2 Answers2

Linked