1

I have a bunch of strings in my program. I don't know how these strings are encoded, or what determines the encoding. Everything I read about this in c++ seems to simply state that char and string are encoding agnostic and don't care about what you shove into them. However, it would be very useful to me to know this.

When I write std::string = "some text" if this is a utf8/16/32, ascii, extended ascii, or other encoding scheme that gets utilized. Is this determined by the compiler? Or the target OS? The target hardware?

What factors are at play in the implementation that determines how the string gets encoded, and how can I detect them?

Does the encoding of the source file matter? If all my source files are utf8, can I guarantee that gcc/g++ will always encode strings as utf8?

alrav
  • 117
  • 8
  • Check the reference [here](https://en.cppreference.com/w/cpp/string/basic_string) – francesco Dec 16 '22 at 20:01
  • 1
    Corresponding C question: [GCC 4.7 Source Character Encoding and Execution Character Encoding For String Literals?](https://stackoverflow.com/q/12216946/12149471) Most of this information probably also applies to C++. – Andreas Wenzel Dec 16 '22 at 20:07
  • @francesco from what I'm seeing it just talks about character widths, not the encoding scheme that gets utilized for literals. Unless I'm just missing where it specifies that? – alrav Dec 16 '22 at 20:08
  • `std::string` is a container. It can hold anything, but if it doesn't hold the implementation's standard `char` encoding, you cannot take full advantage of all that `string` offers. What the standard encoding is can vary from implementation to implementation, but almost universally it's ASCII. – user4581301 Dec 16 '22 at 20:10
  • @user4581301 I'm aware it's merely a container that varies from implementation to implemnentation. I was hoping for some way to go down a level of abstraction, with some system call, or a macro definition, or something that specifies the implementation, rather than simply assuming it's always ascii – alrav Dec 16 '22 at 20:12
  • [This](https://en.cppreference.com/w/cpp/language/types) is perhaps useful, see in particular "Character types". – francesco Dec 16 '22 at 20:20
  • 1
    Get an upvote from me. I currently have the same understanding issues. I already figured out (using Microsoft C++) that it depends on the encoding of the source code file. You may get different results, depending on how you save the file. Try ASCII, UTF8 with BOM and other Encodings. Also, don't only use ASCII characters in the string. What I'm investigating next is, whether the code page used while compiling the source file also matters. It'll take some time for me to figure that out – Thomas Weller Dec 16 '22 at 20:22
  • 1
    I don't have a good, cross-platform way to get the default encoding. [`setlocale`](https://en.cppreference.com/w/cpp/locale/setlocale) will get the current encoding, but if you haven't set the locale yourself, you get "C". – user4581301 Dec 16 '22 at 20:25
  • 2
    Also note that this is actually quite hard to test. There are multiple sources of errors: 1. the console (there might be differences between Powershell and cmd, for example) 2. the font (not all fonts support the same characters) 3. the code page that is active when running the program 4. (maybe) the language of your OS (I installed a Japanese VM to do tests) 5. as mentioned before: encoding of your source file 6. as mentioned before: (maybe) the code page used during compilation 7. the debugger may display something else than is printed on the console 8. other programs (e.g. Excel) – Thomas Weller Dec 16 '22 at 20:30
  • @ThomasWeller It's great to see other folks looking into this. I was disappointed when I initially didn't find any obvious answers beyond "implementation defined". I followed most of your comments, but could you elaborate a bit on what a "code page" is, or link a good reference? An initial google shows it may just be a microsoft way of defining encoding schemes... but I'm unsure and am more familiar with merely cross compiling for linux dev. – alrav Dec 16 '22 at 20:39
  • 1
    I just started to learn about code pages myself. Code pages is the thing between ASCII and Unicode, at a time when people realized that ASCII is not enough and Unicode was not invented yet. Since ASCII was 7 bit, they took the 8th bit and defined additional 128 for their purposes. There are many code pages and each one has different characters, depending on the language it was supposed to represent. See https://en.wikipedia.org/wiki/Code_page for a start. Crux is that even "code page" is not a great term, since you have DOS, Windows, IBM, Macintosh, ... code pages. – Thomas Weller Dec 16 '22 at 20:44
  • TLDR: No, you have to tell the compiler/OS what encoding a character string uses. Directly, or indirectly, somehow. There is no universal rule that, given some randomly-generated 8-bit values magically determines what encoding it uses, if it's text. A sequence of 8-bit values is a sequence of 8-bit values. The End. Some operating systems might have some functions that use heuristics that attempt to guess as to the encoding. At most, this will always be a guess. – Sam Varshavchik Dec 16 '22 at 20:44
  • I don't know if Linux has the same issue with code pages. – Thomas Weller Dec 16 '22 at 20:44
  • See [this page](https://en.cppreference.com/w/cpp/language/charset) especially the "Code unit and literal encoding" section. – Miles Budnek Dec 16 '22 at 20:45
  • 1
    `std::string s = "my string"` `s` will be in whatever encoding was used when compiling c++ file (if a file was saved in utf-8 it will be utf-8). Then you need to interpret this encoding. If your system uses, say, CP1251, you'll see garbage (unless you explicitly require utf-8 with setlocale and etc.). – Osyotr Dec 16 '22 at 20:45
  • 1
    @Osyotr Not necessarily. Even if the source file is encoded in utf-8 the actual string will be translated to the execution character set by the compiler (assuming that isn't also utf-8). – Miles Budnek Dec 16 '22 at 20:47
  • @MilesBudnek I was about to write about source/execution charsets. IMHO, using something other than utf-8 as execution charset is bogus. – Osyotr Dec 16 '22 at 20:49
  • Ah, ok, so it seems code pages are basically definitions for what I think of as extended asscii sets: https://www.praim.com/en/news/character-encodings-in-linux-ascii-utf-8-and-iso-8859/ That sheds some more light on things.... it looks like for the most part on linux only ascii was supported until the jump was made to utf-8...though there were many proprietary encodings that might have been used, they are considered non standard it seems... I think. At least on the linux side. – alrav Dec 16 '22 at 21:33

0 Answers0