2

Excuse me if the question is stupid, it's kind of confused me, suppose I have a application(no matter C, C++,.NET or Java) on my Windows XP, and this application will get data from a remote machine, the data contain Chinese characters, now if Chinese characters become junk, is it correct to say that Windows has nothing to do with this issue? because Windows uses UTF-16, and can handle Chinese characters properly.

On the other hand, suppose Windows uses ASCII as its internal encoding, does this mean that any applications on it can never display Chinese characters correctly?

Thanks in advance.

gfytd
  • 1,747
  • 2
  • 24
  • 47
  • 1
    as it says "internal encoding" - that only talks about Windows itself... whatever your app does or does not is entirely up to the app! – Yahia Oct 31 '11 at 05:36

2 Answers2

4

The Windows NT kernel uses UNICODE_STRING for many (or is it most?) named objects (e.g. files). The encoding is UTF-16.

Many of user-mode callable APIs expose pairs of almost identical functions, where one in the pair accepts Unicode strings and, the other, ANSI strings. The ANSI string versions end up converting names from ANSI to Unicode.

For example, when you call C's fopen() function, which accepts 8-bit non-Unicode file names, it ends up invoking CreateFileA() (ANSI), and that eventually calls NtCreateFile(), which accepts Unicode file names. One of NtCreateFile()'s parameters, the OBJECT_ATTRIBUTES structure, contains a pointer to a UNICODE_STRING structure.

If you, on the other hand, call MSVC++'s _wfopen() function, it will reach NtCreateFile() through CreateFileW() (Unicode) without the conversion.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • 2
    Is it UTF-16 or UCS-2? UTF-16 includes sequences to represent characters exceeding 0xFFFF; UCS-2 is purely 16-bit. The referenced web page doesn't say. (It also talks about the string being "NULL-terminated", but NULL is a null *pointer* constant.) – Keith Thompson Oct 31 '11 at 08:02
  • @Keith Thompson: The Rtl* string functions compare bytes, not code points, but the kernel really doesn't care about encodings anyway. Also, what the documentation is trying to say is that UNICODE_STRINGs aren't necessarily null-terminated (with null being (WCHAR)0), and that the Length field never includes any null-terminator, if present. – wj32 Oct 31 '11 at 08:50
  • @wj32: The point is that "NULL-terminated" is incorrect; "null-terminated" would be correct. (I've e-mailed them; we'll see if it does any good.) – Keith Thompson Oct 31 '11 at 08:59
  • @Keith Thompson: There's a constant called UNICODE_NULL, so they might be referring to that. – wj32 Oct 31 '11 at 09:01
  • 3
    @KeithThompson: I don't think there's much need and full support for Unicode everywhere in Windows, for example, [NTFS doesn't try to make much sense of file names nor validate them](http://en.wikipedia.org/wiki/NTFS#Internals): `NTFS allows any sequence of 16-bit values for name encoding (file names, stream names, index names, etc.). This means UTF-16 codepoints are supported, but the file system does not check whether a sequence is valid UTF-16 (it allows any sequence of short values, not restricted to those in the Unicode standard).` – Alexey Frunze Oct 31 '11 at 09:13
0

To store any text in memory and display it on screen, the OS needs to handle that text in some encoding behind the scenes. What encoding that is specifically shouldn't matter to you. It could handle it as HTML encoded ASCII for all you know, as long as the APIs accept certain text and it outputs the right thing.

"Windows uses UTF-16 internally" means Windows happens to store and handle text internally as UTF-16. It also supports Chinese text. These two things aren't necessarily connected. Yes, using UTF-16 internally makes it easier to support Chinese, which is probably why the Windows engineers chose to go with UTF-16.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    OK,Thanks.So it all depends on the app, and has nothing to do with OS, right? How about the regional and language options(which could be found in Windows 'Control panel')? Would this setting cause any junk characters problems? – gfytd Oct 31 '11 at 06:38
  • 1
    Depends on your app. There's a setting that controls how legacy apps that are *not* using Unicode should behave. If your app is properly handling itself non of these settings should make a difference. I can't say much more since I'm really not a Windows programmer. – deceze Oct 31 '11 at 06:55
  • Why doesn't Microsoft use UTF-32, so it supports all characters? Like, UTF-16 can't have all the characters Unicode 13.0.0 has, for example - right? (I've been reading about this and I hope I'm not making a stupid question haha) From what I got from this, to support all the characters, Windows UTF-32. Or not? Or it can use UTF-16 or UTF-8 and use like \u4356+\u2349 to represent a 32-bit character? – Edw590 Sep 10 '20 at 09:49