5

I've found answers to this question for many programming languages, except for C, using the Windows API. No C++ answers please. Consider the following:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
WCHAR unistring[strlen(string)+1];

What function can I use to fill unistring with the characters from string?

dda
  • 6,030
  • 2
  • 25
  • 34
user1540336
  • 65
  • 1
  • 1
  • 3
  • Does unicode mean UTF16-LE here? – Esailija Jul 20 '12 at 09:49
  • Please specify what *encoding* you mean, "Unicode" is not an encoding, it does not tell you how to represent characters as bits in memory. – unwind Jul 20 '12 at 09:51
  • @Esailija: That'd be UCS-2, not UTF16-LE... – DevSolar Jul 20 '12 at 09:52
  • @DevSolar ok, in windows context unicode has usually meant [UTF16-LE](http://stackoverflow.com/a/3951826/995876) so I guessed wrong :P – Esailija Jul 20 '12 at 09:55
  • @DevSolar Windows switched from UCS-2 to UTF16-LE many many years ago. I think by the time Windows 2000 came out the transition was complete. – Mark Ransom Aug 25 '20 at 16:21
  • @MarkRansom See my comment unter Rup's answer. That document has been updated 2018, and *still* states that support for suplemental characters - i.e. UTF-16 - is not universal. – DevSolar Aug 26 '20 at 05:05
  • @DevSolar there's a note on the page, but it only applies to Windows 2000. Either the page is hopelessly out of date, or there are Windows bugs that they haven't deemed important enough to fix. – Mark Ransom Aug 26 '20 at 15:09

6 Answers6

12

MultiByteToWideChar:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
size_t len = strlen(string);
WCHAR unistring[len + 1];
int result = MultiByteToWideChar(CP_OEMCP, 0, string, -1, unistring, len + 1);
Rup
  • 33,765
  • 9
  • 83
  • 112
  • The linked doc for that function says "Maps a character string to a UTF-16 (wide character) string." Note that: 1) Conversion is done to UCS-2, not UTF-16. 2) UCS-2 is not Unicode, and can encode only characters from the BMP. 3) UTF-16 is not "wide", but multibyte. I really wish Microsoft would get their act together and stop spreading disinformation on this subject. – DevSolar Jul 20 '12 at 10:12
  • Correction: UCS-2 can encode more than just the BMP, but in doing so you are leaving the encoding range where UCS-2 and UTF-16 are mostly compatible. – DevSolar Jul 20 '12 at 11:28
  • 1
    I haven't bothered testing the differences myself, but [this Microsoft blog](http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx) says that since XP it really has been UTF-16 not UCS-2. – Rup Jul 20 '12 at 13:16
  • Actually it says UTF-16 "became more fully supported", whatever that might mean. [This](http://msdn.microsoft.com/en-us/library/dd374069.aspx) is much more enlightening, though it states that in Win2K "not all system components are compatible with supplementary characters" - and as far as I could see, it's the latest installment of that document, leaving anyone guessing at what might still be lurking in the depths of the API. The fact remains that having a 16-bit WCHAR is plain and simply *wrong*, because it's multibyte, not wide. I still recommend ICU over any native C API. – DevSolar Jul 20 '12 at 13:48
3

If you are really serious about Unicode, you should refer to International Components for Unicode, which is a cross-platform solution for handling Unicode conversions and storage in either C or C++.

Your WCHAR, for example, is not Unicode to begin with, because Microsoft somewhat prematurely defined wchar_t to be 16bit (UCS-2), and got stuck in backward compatibility hell when Unicode became 32bit: UCS-2 is almost, but not quite identical to UTF-16, the latter being in fact a multibyte encoding just like UTF-8. "Wide" format in Unicode means 32 bit (UTF-32), and even then you don't have a 1:1 relationship between code points (i.e. 32bit-values) and abstract characters (i.e. a printable glyph).

Gratuituous, losely related list of links:

DevSolar
  • 67,862
  • 21
  • 134
  • 209
2

You should look into MultiByteToWideChar function.

dda
  • 6,030
  • 2
  • 25
  • 34
pive_
  • 51
  • 4
1

If you KNOW that the input is pure ASCII and there are no extended character sets involved, there's no need to call any fancy conversion function. All the character codes in ASCII are the same in Unicode, so all you need to do is copy from one array to the other.

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
int len = strlen(string);
WCHAR unistring[len+1];
int i;
for (i = 0; i <= len; ++i)
    unistring[i] = string[i];
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

You can use mbstowcs to convert from "multibyte" to wide character strings.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
0

This is another way to do it. It's not as direct, but when you don't feel like typing in 6 arguments in a very specific order, and remembering codepage numbers/macros to MultiByteToWideChar, it does the job. Takes 16 microseconds on this laptop to perform, most of it(9 microseconds) spent in AddAtomW.

For reference, MultiByteToWideChar takes between 0 and 1 microseconds.

#include <Windows.h>

const wchar_t msg[] = L"We did it!";

int main(int argc, char **argv)
{
    char result[(sizeof(msg) / 2) + 1];        
    ATOM tmp;

    tmp = AddAtomW(msg);
    GetAtomNameA(tmp, result, sizeof(result));
    MessageBoxA(NULL ,result,"it says", MB_OK | MB_ICONINFORMATION);
    DeleteAtom(tmp);

    return 0;
}
Dmytro
  • 5,068
  • 4
  • 39
  • 50