How to distinguish a WCHAR is Chinese, Japanese or ASCII?

Question

For example delphi code

  wchar_IsASCii : array[0..1] of WCHAR ;

  wchar_IsASCii[0] := 'A'   ;
  wchar_IsASCii[1] := 'じ'  ;

How can I tell whether wchar_IsASCii[0] belong to ASCII, or wchar_IsASCii[1] does not belong to ASCII?

Actually, I only need know whether a UNICODE char belong to ASCII, that’s all How to distinguish a WCHAR char is Chinese, Japanese or ASCII.

In short, you can't. You need extra information to be able to determine what language a given character comes from. See [Mojibake 文字化け](http://en.wikipedia.org/wiki/Mojibake) — Leonardo Herrera, Apr 17 '13 at 19:13
...but, it seems that you want to determine if a character falls outside the ASCII range. That's just asking if the value of that char is greater than 127, isn't it? In any case, it seems that you should read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) — Leonardo Herrera, Apr 17 '13 at 19:15

score 1 · Answer 1 · edited May 23 '17 at 11:57

I don't know Delphi, but what I can tell you is you need to determine what range the character fits into in Unicode. Here is a link about finding CJK characters in Unicode: What's the complete range for Chinese characters in Unicode?

and unless Delphi has some nice library for distinguishing Chinese and Japanese charatcers, you're going to have to determine that yourself. Here is a good answer here on SO for how to do that: Testing for Japanese/Chinese Characters in a string

score 1 · Answer 2 · edited May 23 '17 at 12:18

The problem is... what do you mean by ASCII ? Original ASCII standard is 7-bit code, known as Latin1 - it is not even a byte.

Then if you come with so-called "extended ASCII" - a 1 byte items - then half of it can be next to anything. It can by Greek on one machien, European diacritics on another, Cyrillic at third one... etc.

So i think if all you need is testing whether you have 7 bit Latin1 character - ruling out extended characters from French, German, Spanish alphabets and all Scandinavians ones, then - as Unicode was designed as yet another superset for Latin1 what you need is checking that (0 <= Ord(char-var)) and ($7f >= Ord(char-var)).

However, if you really need to tell languages, if you consider Greek And Cyrillic somewhat ASCII and Japanese alphabets (there are two by the way, Hiragana and Katakana) not (or if you consider French and German more or less ASCII-like, but Russian not) you would have to look at Unicode Ranges.
http://www.unicode.org/charts/index.html

To come with 32-bit codepoint of UCS4 standard you can use http://docwiki.embarcadero.com/Libraries/XE3/en/System.Character.ConvertToUtf32

There are next to standard IBM Classes for Unicode but looks no good translation for Delphi exists Has anyone used ICU with Delphi?

You can use Jedi CodeLib, but its tables are (comments are contradicting) either from Unicode 4.1 or 5.0, not from current 6.2, though for Japanese version 5.0 should be enough.

You can also use Microsoft MLang interface to query internet-like character codes (RFC 1766)

AFAIK `ord($7f)` is codepage-dependent. It is already depending on the current charset. But +1 for all the linked info. AFAIK ASCII = 7 bit, ANSI = all diverse 8 bit encodings - so your info is correct. — Arnaud Bouchez, Apr 17 '13 at 11:33
@ArnaudBouchez for what i remember by specs both 127 and 255 are control codes rather than symbols. OTOH who cares of old specs today :-) // That made a bad bad impact on many FTP servers which did not cared of special escaping for 254/255 in control connection and windows-1251 filenames used them a lot. // And doesn't ANSI (att least in Win32 sense) stand for MBCS contrasting with SBCS ASCII ? — Arioch 'The, Apr 17 '13 at 11:45
@ArnaudBouchez: `$7F` is not codepage-dependant. It falls within the 7bit ASCII range, which is the same in all codepages for compatibility purposes. The 8bit `$80`-`$FF` values are codepage-dependant, though. — Remy Lebeau, Apr 17 '13 at 16:09

score 0 · Answer 3 · answered Apr 17 '13 at 05:28

0

Generally, a character belongs to ASCII, if its code is in range 0x0000..0x007F, see http://www.unicode.org/charts/PDF/U0000.pdf. A new Delphi has class function TCharacter.IsAscii but it is from some strange reason declared as private.

answered Apr 17 '13 at 05:28

pf1957

997
1
5
20

score 0 · Answer 4 · answered Apr 17 '13 at 06:00

ASCII characters have a decimal value less than 127.

However, unless you are running a teletype machine from the 1960's, ASCII chars may not be sufficient. ASCII chars will only cover English language characters. If you actually need to support "Western European" characters such as umlaut vowels, graves, etc, found in German, French, Spanish, Swedish, etc, then testing for Unicode char value <= 127 won't suffice. You might get away with testing for char value <= 255, as long as you don't need to work with Eastern European scripts.

How to distinguish a WCHAR is Chinese, Japanese or ASCII?

4 Answers4