1

I have a c++11 library that I am writing that provides a cross platform API for setting an environment variable. The benefit of c++11 is that all char strings are UTF-8:

environment::Set(const std::string& name, const std::string& value)

On Windows there is the SetEnvironmentVariable function that has two aliases SetEnvironmentVariableA and SetEnvironmentVariableW.

My understanding is that the wide version takes a 16bit wchar_t that in Windows land is UTF-16 and the ANSI version is ASCII.

Is the correct way to use this function to convert the std::string into UTF-16 (with std::codecvt_utf8_utf16 or something) then put in into the wide version of the function?

Matt Clarkson
  • 14,106
  • 10
  • 57
  • 85
  • By default (most build systems have [`UNICODE`](http://msdn.microsoft.com/en-us/library/ff381407.aspx) defined), the W variant is chosen when you just call _SetEnvironmentVariable_ and hence calling _SetEnvironmentVariableW_ isn't required. – legends2k Sep 25 '13 at 15:04

2 Answers2

4

Yes, Windows supports Unicode only through the "wide" versions of its APIs (that use UTF-16); the "ANSI" (char-based) functions only support "local" codepages, not UTF-8.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • FYI Visual Studio 2013 is deprecating non-Unicode builds: http://blogs.msdn.com/b/vcblog/archive/2013/07/08/mfc-support-for-mbcs-deprecated-in-visual-studio-2013.aspx – the_mandrill Sep 25 '13 at 15:02
  • @the_mandrill: that was about time, especially since the only sensible targets for MBCS builds (Windows 9x & co.) aren't supported by the CRT anyway since several VC++ versions. – Matteo Italia Sep 25 '13 at 15:03
  • OK, so it is safe to call the function with `reinterpret_cast(std::utf16string.data())` once the `std::string` is converted? – Matt Clarkson Sep 25 '13 at 15:09
  • @MattClarkson: uhm, why would you need that `reinterpret_cast`? You should use a wide string class to store the UTF16 string and its `c_str()` method to get a constant pointer to its data. – Matteo Italia Sep 25 '13 at 15:15
  • @MatteoItalia +1 sorry, didn't quite understand all the codecvt things available to me :) Do you know if Windows expects little or big endian `UTF-16`? – Matt Clarkson Sep 25 '13 at 15:17
  • [Relevant stackoverflow answer](http://stackoverflow.com/questions/11040703/convert-unicode-to-char/11040983#11040983) – Matt Clarkson Sep 25 '13 at 15:19
  • @MattClarkson: little endian (in general on Windows everything - within experimental error - is little endian, due to the fact that x86 is little endian). – Matteo Italia Sep 25 '13 at 15:24
4

The benefit of c++11 is that all char strings are UTF-8:

This is not specified by C++11 for normal string literals and you'll find VC++ doesn't make it so. If you want UTF-8 strings then you have to ensure that yourself.

My understanding is that the wide version takes a 16bit wchar_t that in Windows land is UTF-16 and the ANSI version is ASCII.

The *A functions always use the system code page which is an extended version of ASCII (and is never UTF-8).

Is the correct way to use this function to convert the std::string into UTF-16 (with std::codecvt_utf8_utf16 or something) then put in into the wide version of the function?

If you have ensured that your strings are UTF-8 (which is a good idea, IMO) then converting to UTF-16 and using the wchar_t version is the correct thing to do.

#include <Windows.h>
#include <codecvt>

int main() {
  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;

  std::string var = "\xD0\xBA\xD0\xBE\xD1\x88\xD0\xBA\xD0\xB0"; // кошка
  std::string val = "\xE6\x97\xA5\xE6\x9C\xAC\xE5\x9B\xBD";     // 日本国

  SetEnvironmentVariableW(convert.from_bytes(var).c_str(),
                          convert.from_bytes(val).c_str());
}

With full C++11 conformance we could write std::string var = u8"кошка";, however VC++ doesn't implement this and it appears to be a very low priority item since it doesn't appear explicitly on their roadmap to C++14 conformance.

Alternatively you can write std::string var = "кошка"; if you save your source code as "UTF-8 without BOM". Be aware that that method has caveats such as that you can't use wchar_t literals.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • 1
    Good answer, with great example. – Matt Clarkson Sep 25 '13 at 16:16
  • I thought the default encoding for `codecvt_utf8_utf16` was UTF-16BE - and Windows demands `LE`? No need to specify the endian type here? – thomthom Dec 14 '15 at 15:59
  • @thomthom The default mode is big-endian, but that doesn't matter here. – bames53 Dec 14 '15 at 17:06
  • @bames53 - why does it not matter here when calling a Win32 API function? – thomthom Dec 14 '15 at 18:28
  • @thomthom It doesn't matter here when doing the conversion, because the codecvt_mode parameter pertains to the external encoding, which in the case of `codecvt_utf8_utf16` is UTF-8, for which endianness is irrelevant. The endianness used for the internal encoding is never affected by the codecvt_mode parameter. – bames53 Dec 14 '15 at 18:36