Convert ASCII string to Unicode? Windows, pure C

Question

I've found answers to this question for many programming languages, except for C, using the Windows API. No C++ answers please. Consider the following:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
WCHAR unistring[strlen(string)+1];

What function can I use to fill unistring with the characters from string?

Please specify what *encoding* you mean, "Unicode" is not an encoding, it does not tell you how to represent characters as bits in memory. — unwind, Jul 20 '12 at 09:51
@DevSolar ok, in windows context unicode has usually meant [UTF16-LE](http://stackoverflow.com/a/3951826/995876) so I guessed wrong :P — Esailija, Jul 20 '12 at 09:55
@DevSolar Windows switched from UCS-2 to UTF16-LE many many years ago. I think by the time Windows 2000 came out the transition was complete. — Mark Ransom, Aug 25 '20 at 16:21
@MarkRansom See my comment unter Rup's answer. That document has been updated 2018, and *still* states that support for suplemental characters - i.e. UTF-16 - is not universal. — DevSolar, Aug 26 '20 at 05:05
@DevSolar there's a note on the page, but it only applies to Windows 2000. Either the page is hopelessly out of date, or there are Windows bugs that they haven't deemed important enough to fix. — Mark Ransom, Aug 26 '20 at 15:09

score 12 · Accepted Answer · answered Jul 20 '12 at 09:48

12

MultiByteToWideChar:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
size_t len = strlen(string);
WCHAR unistring[len + 1];
int result = MultiByteToWideChar(CP_OEMCP, 0, string, -1, unistring, len + 1);

answered Jul 20 '12 at 09:48

Rup

33,765
9
83
112

The linked doc for that function says "Maps a character string to a UTF-16 (wide character) string." Note that: 1) Conversion is done to UCS-2, not UTF-16. 2) UCS-2 is not Unicode, and can encode only characters from the BMP. 3) UTF-16 is not "wide", but multibyte. I really wish Microsoft would get their act together and stop spreading disinformation on this subject. – DevSolar Jul 20 '12 at 10:12
Correction: UCS-2 can encode more than just the BMP, but in doing so you are leaving the encoding range where UCS-2 and UTF-16 are mostly compatible. – DevSolar Jul 20 '12 at 11:28
1

I haven't bothered testing the differences myself, but [this Microsoft blog](http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx) says that since XP it really has been UTF-16 not UCS-2. – Rup Jul 20 '12 at 13:16
Actually it says UTF-16 "became more fully supported", whatever that might mean. [This](http://msdn.microsoft.com/en-us/library/dd374069.aspx) is much more enlightening, though it states that in Win2K "not all system components are compatible with supplementary characters" - and as far as I could see, it's the latest installment of that document, leaving anyone guessing at what might still be lurking in the depths of the API. The fact remains that having a 16-bit WCHAR is plain and simply *wrong*, because it's multibyte, not wide. I still recommend ICU over any native C API. – DevSolar Jul 20 '12 at 13:48

DevSolar · Answer 2 · 2012-07-20T11:00:42.500

If you are really serious about Unicode, you should refer to International Components for Unicode, which is a cross-platform solution for handling Unicode conversions and storage in either C or C++.

Your WCHAR, for example, is not Unicode to begin with, because Microsoft somewhat prematurely defined wchar_t to be 16bit (UCS-2), and got stuck in backward compatibility hell when Unicode became 32bit: UCS-2 is almost, but not quite identical to UTF-16, the latter being in fact a multibyte encoding just like UTF-8. "Wide" format in Unicode means 32 bit (UTF-32), and even then you don't have a 1:1 relationship between code points (i.e. 32bit-values) and abstract characters (i.e. a printable glyph).

Gratuituous, losely related list of links:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The UTF-8 Everywhere Manifesto
Commonly confused characters by Greg Baker

score 2 · Answer 3 · edited Jul 20 '12 at 13:21

2

You should look into MultiByteToWideChar function.

edited Jul 20 '12 at 13:21

dda

6,030
2
25
34

answered Jul 20 '12 at 09:48

pive_

51
4

score 1 · Answer 4 · answered Aug 25 '20 at 16:13

If you KNOW that the input is pure ASCII and there are no extended character sets involved, there's no need to call any fancy conversion function. All the character codes in ASCII are the same in Unicode, so all you need to do is copy from one array to the other.

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
int len = strlen(string);
WCHAR unistring[len+1];
int i;
for (i = 0; i <= len; ++i)
    unistring[i] = string[i];

score 0 · Answer 5 · answered Jul 20 '12 at 09:48

0

You can use mbstowcs to convert from "multibyte" to wide character strings.

answered Jul 20 '12 at 09:48

Some programmer dude

400,186
35
402
621

That is incorrect @Joachim, the `[N]` will allocate `N` `WCHAR`. It would have been correct if was a `char` array. – hmjd Jul 20 '12 at 09:49
Huh? Of course `WCHAR unistring[n]` reserves n `WCHAR`s, so no need to scale. Otherwise `int x[4]` would just reserve one integer on a 4-byte integer system? – unwind Jul 20 '12 at 09:50
@hmjd Ah damn, I was thinking to quick again! – Some programmer dude Jul 20 '12 at 09:51
@unwind Yeah, removed faulty stuff. – Some programmer dude Jul 20 '12 at 09:53

Dmytro · Answer 6 · 2018-03-30T16:52:54.290

This is another way to do it. It's not as direct, but when you don't feel like typing in 6 arguments in a very specific order, and remembering codepage numbers/macros to MultiByteToWideChar, it does the job. Takes 16 microseconds on this laptop to perform, most of it(9 microseconds) spent in AddAtomW.

For reference, MultiByteToWideChar takes between 0 and 1 microseconds.

#include <Windows.h>

const wchar_t msg[] = L"We did it!";

int main(int argc, char **argv)
{
    char result[(sizeof(msg) / 2) + 1];        
    ATOM tmp;

    tmp = AddAtomW(msg);
    GetAtomNameA(tmp, result, sizeof(result));
    MessageBoxA(NULL ,result,"it says", MB_OK | MB_ICONINFORMATION);
    DeleteAtom(tmp);

    return 0;
}

Convert ASCII string to Unicode? Windows, pure C

6 Answers6

Linked