4

I'm linking a text file into my project by adding it to the resource and then loading it.

I use LockResource and a static_cast to cast it to a std::wstring
std::wstring sData(static_cast<wchar_t*>(pData));

My project uses UNICODE (windows), which is why I'm using std::wstring and wchar_t.

I found out that I have to set the encoding in the file to UCS-2 LE, otherwise it would just read gibberish. I'm guessing this is because that is what encoding Windows uses.

My question is, is it safe to assume all Windows operating systems currently use UCS-2 LE? I don't want to run into a system using UCS-2 BE (or something else). My program would crash horribly.

I could save the file in ANSI, and then convert it to what ever encoding the operating system is using with MultiByteToWideChar, but this would be a waste of time if it's definitely going to be UCS-2 LE.

Josh
  • 6,046
  • 11
  • 52
  • 83

2 Answers2

6

All recent and current versions of Windows (excluding XBox) use UTF-16 LE.

Note that there's a bug with how you're initializing the string variable:

std::wstring sData(static_cast<wchar_t*>(pData));

This assumes that the resource ends with a terminating (two-byte) 0, which I don't think is guaranteed if you're just referencing the file in your resources. You should get the size of the resource, and use the two-pointer constructor for sData.

If you're worried about time (as suggested by your comment on using MultiByteToWideChar), you should be aware that you're copying the data from the resource into dynamic memory, and this copy is probably almost as slow as doing a conversion. If you're doing this just once, I wouldn't worry about the speed. I'd save the text as UTF-8, and use MultiByteToWideChar, especially if the UTF-8 encoding is more efficient for your text, as that would make your binary smaller.

If speed is an issue (and if you don't need to modify the string at run time), then I wouldn't use a std::wstring at all. I'd make a class that provides a similar interface, but have it point directly to the resource memory rather than copy the entire text into dynamic memory. That saves load time and memory.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • 4
    I believe Windows actually uses UTF-16, which is not the same thing as UCS-2. Also, it's debatable whether or not the OS on the Xbox counts as Windows, but the Xbox is big endian. – bames53 Aug 29 '12 at 16:50
0

All versions of windows are LE and I don't think microsoft has a plan to change its OS to BE. and windows NT 5(Win2K) and later are all based on UTF-16 so Yes it is always safe to assume windows is UCS-2 LE

BigBoss
  • 6,904
  • 2
  • 23
  • 38
  • 3
    UCS-2 and UTF-16 are not the same thing. They only agree with each other for Unicode codepoints between U+0000 through U+FFFF (the BMP plane). UCS-2 cannot handle the thousands of available Unicode codepoints between U+10000 through U+10FFFF. – Remy Lebeau Aug 31 '12 at 18:22
  • @Remy Of course you are right, but for the case of LE/BE difference that asked in the question I think they can assumed as same thing and you know `UCS-2` is something from history and now superseded with `UTF-16` because of those code points that you say. – BigBoss Sep 01 '12 at 01:58
  • 3
    LE/BE has nothing to do with the difference between UCS-2 and UTF-16. UCS-2 and UTF-16 both have LE/BE variants. Whatever fits in UCS-2 will fit in UTF-16, but the reverse is not true. You have to make this distinction or else you risk truncating your data. – Remy Lebeau Sep 01 '12 at 17:36
  • You are true, but today we have nothing as UCS-2 since it is superseded with UTF-16 and also I mean if you use UCS-2 you can't address code points outside of BMP, so if the user who ask the question want to ensure the data that it use can be represented as UCS-2 it won't use those code points and also I think he/she has a problem in size of characters and their endian not code point outside of BMP. But anyway you are right, thanks for the comment :) – BigBoss Sep 01 '12 at 20:58
  • @RemyLebeau: One could argue that they only agree for code points U+0000 through U+D7FF and U+E000 through U+FFFF. – Mooing Duck Apr 24 '14 at 18:46
  • 1
    @MooingDuck: technically true, although AFAIK UCS-2 did not define any meaning for codepoints U+D800 through U+DFFF, which is probably why Unicode chose to reserve them for UTF-16 without losing compatibility. – Remy Lebeau Apr 24 '14 at 19:02