7

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?

dda
  • 6,030
  • 2
  • 25
  • 34
cedric
  • 3,107
  • 15
  • 54
  • 65

7 Answers7

16

ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.

I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.

7

Please see my similar question regarding Kanji/Kana characters. As @coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.

In short, the Unicode ranges for hiragana and katakana are:

  • Hiragana: Unicode: 3040-309F
  • Katakana: Unicode: 30A0–30FF

If you find this answer useful please upvote @coobird's answer to my question as well.

がんばって!

Community
  • 1
  • 1
Zack The Human
  • 8,373
  • 7
  • 39
  • 60
3

Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...

http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).

For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.

  • 2
    "Unicodes are hexadecimal". Um. This is a completely nonsensical statement. Code points are just numbers; hexadecimal is just a way of writing numbers. I'm sure I can find a unicode listing in decimal somewhere on the web. – Nyerguds Apr 12 '17 at 07:27
2

Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?

Noon Silk
  • 54,084
  • 6
  • 88
  • 105
0

I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.

dda
  • 6,030
  • 2
  • 25
  • 34
  • An interesting related queestion would be "is there an 8-bit extended ASCII encoding for Japanese?", though ;) – Nyerguds Apr 12 '17 at 07:31
0

Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.

Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.

Tanner Swett
  • 3,241
  • 1
  • 26
  • 32
-2

I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.

Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.

If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.

Nap
  • 8,096
  • 13
  • 74
  • 117
  • 7
    Unicode is *not* a “double byte character set”. Do not confuse encodings with the character set itself. The Unicode standard provides, among other things, a mapping between characters and numbers (‘code points’). When you talk about a “two byte Unicode”, you are probably referring UCS2 (two bytes per code point, can not represent all Unicode characters) or UTF-16 (two or four bytes per code point). Other encodings include UTF-32 (a four byte encoding) and UTF-8 (an encoding that uses one, two, three, or four bytes per code point). – Chris Johnsen Nov 26 '09 at 05:26