3

In my program I have a std::string that contains text encoded using the "execution character set" (which is not guaranteed to be UTF-8 or even US-ASCII), and I want to convert that to a std::string that contains the same text, but encoded using UTF-8. How can I do that?

I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object? What function or constructor must I use?

I assume the standard library provides some means for doing this (somewhere, somehow), because the compiler itself must know about UTF-8 (to support UTF-8 string literals) and the execution character set.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
  • I personally would look for some library such as [ICU](http://site.icu-project.org/). Maybe you get along with a more light-weight library as proposed [here](https://stackoverflow.com/questions/745536/small-open-source-unicode-library-for-c-c)? – Aconcagua Jun 26 '18 at 11:20
  • @Aconcagua To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that? – Raedwald Jun 26 '18 at 11:29
  • OS-dependent... I am not aware of any up to date linux/bsd distrubution that do not use UTF-8 as native character set anyway, so you probably don't need to care... Windows: there is some API for, I'd start searching at [GetUserDefaultLCID](https://learn.microsoft.com/en-us/windows/desktop/api/winnls/nf-winnls-getuserdefaultlcid)... Possibly even one of the libraries provides suitable API. – Aconcagua Jun 26 '18 at 11:38
  • How to get the execution character encoding? Well, someone had to tell the compiler at build time. If they also built it into the program's data then you could know. – Tom Blodget Jun 26 '18 at 16:52

2 Answers2

0

I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object?

You can get a std::codecvt object only as a base class instance (by inheriting from it) because the destructor is protected. That said no, std::codecvt<char, char, std::mbstate_t> is not a facet that you need since it represents the identity conversion (i.e. no conversion at all).

At the moment, the C++ standard library has no functionality for conversion between the native (aka excution) character encoding (aka character set) and UTF-8. As such, you can implement the conversion yourself using the Unicode standard: https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf

To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that?

There is no standard library function for that either. On POSIX system for example, you can use nl_langinfo(CODESET).

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Apart from identity, there is still UTF-X to UTF-Y and native to wide to native narrow character set. Does not change much, all of these are not suitable for the job in question... – Aconcagua Jun 26 '18 at 11:31
  • @Aconcagua furthermore, the functionality to convert using those facets (`std::wstring_convert`) is deprecated. – eerorika Jun 26 '18 at 11:34
0

This is hacky but it worked for me in MS VS2019

#pragma execution_character_set( "utf-8" )
hack-tramp
  • 366
  • 3
  • 11
  • 1
    I think it would be nicer to set the compiler flag directly (/execution-charset:utf-8). See: https://learn.microsoft.com/en-us/cpp/build/reference/execution-charset-set-execution-character-set?view=msvc-160 – Marc Aug 19 '21 at 01:17