Define unicode string as byte array

Question

Let's say we have file main.cpp in windows-1251 encoding with such content:

int main()
{
     wchar_t* ws = L"котэ"; //cat in russian
     return 0;
}

Everything is fine if we compile this in VisualStudio, BUT we gonna compile it with GCC which default encoding for source code is UTF-8. Of course we can convert file encoding or set option "-finput-charset=windows-1251" for compiler, but what if not? There is some way to do this by replacing raw text into hex UTF32 bytes:

int main()
    {
         wchar_t* ws = (wchar_t*)"\x3A\x04\x00\x00\x3E\x04\x00\x00\x42\x04\x00\x00\x4D\x04\x00\x00\x00\x00\x00\x00"; //cat in russian
         return 0;
    }

But it's kind of agly: 4 letters becomes a 20 bytes ((

How else it can be done?

@πάνταῥεῖ, incorrect. That will work only if file main.cpp has utf8 encoding — , Nov 01 '16 at 12:32
@AlekDepler [`U"котэ"`](http://en.cppreference.com/w/cpp/language/string_literal) then. — πάντα ῥεῖ, Nov 01 '16 at 12:35
@πάνταῥεῖ, there is no C++ 11 and I'm pretty sure the problem remains — , Nov 01 '16 at 12:36
@SamVarshavchik, "cat" in ukranian is "кiт", go finish the school pls. — , Nov 01 '16 at 12:37
Instead of putting strings into the source, put them into another file instead. Then the proper conversions, if any, can be done at runtime. Isn't that standard procedure for i18n? — Mark Ransom, Nov 01 '16 at 16:03
@MarkRansom yes, if project is new. What about old platform-dependent code? Massive refactoring? — , Nov 01 '16 at 19:44
@AlekDepler of course context matters, and you didn't provide any in the question. I just wanted to throw that possibility out there, and it wasn't complete enough to warrant an answer. — Mark Ransom, Nov 01 '16 at 20:00
Your question is confusing. Do you mean *"GCC in Windows"* or *"GCC in Linux or another non-Windows operating system"*? `wchar_t` in Windows is 2 bytes, in POSIX is 4 bytes. In Linux you need to convert to UTF8, or use `const char *buf = u8"котэ";` — Barmak Shemirani, Nov 02 '16 at 02:37

score 1 · Answer 1 · edited May 23 '17 at 12:16

1

What you need is to use a file encoding that is understood by both GCC and VS. It seems to me that saving the file in UTF-8 encoding is the way forward.

Also see: How can I make Visual Studio save all files as UTF-8 without signature on Project or Solution level?

edited May 23 '17 at 12:16

Community

1
1

answered Nov 01 '16 at 15:58

m-bitsnbites

994
7
19

I know that and at first look this is simple and obvious problem. But the thing is: if you try to define single-byte non-english string (char* s = "котэ";), save it in utf8 and compile in VisualStudio... guess what? You'll get raw utf8 bytes in that string instead of system locale related, which leads to various problems among all the code (for example strlen will not be able to calculate length correctly). – Nov 01 '16 at 19:38
@AlekDepler you should be staying consistent with your usage of `char` and `wchar_t`, or you will definitely run into trouble like that. If you *must* mix them, it's probably best if you keep the char strings to ASCII only. – Mark Ransom Nov 01 '16 at 20:01
Yes, there are many different API:s that you need to consider. In VS/Windows the convention is to use wchar for anything Unicode (e.g. filenames), while in Linux it is more common to use char and interpret it as UTF-8. strlen and friends, as you say, count bytes, not Unicode characters - which is fine for most purposes (e.g. to determine how much memory to allocate or copy). If you want to write portable code, you need to be careful with how you use your Unicode strings. – m-bitsnbites Nov 02 '16 at 07:07
One approach is to go all-in on UTF-8: encode your files as UTF-8, and use UTF-8 strings all over (use char* strings and/or std::string, and always interpret them as UTF-8). Then you can use a library such as [UTF8-CPP](https://github.com/nemtrif/utfcpp) to convert to/from UTF-16 and get the true string length (number of Unicode code points of a string), etc. – m-bitsnbites Nov 02 '16 at 08:07

Define unicode string as byte array

1 Answers1