What determines how strings are encoded in memory?

Question

Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:

How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?

As a follow-up question:

How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?

It's definitely the application, with strong influences from the language and OS. Windows for example works with UTF-16, so an app will be easier if it uses it too. — Mark Ransom, Oct 26 '22 at 16:54
Converting from one encoding to another is a complex enough process that you'll want to find a library or OS call to do it for you. — Mark Ransom, Oct 26 '22 at 16:55
Not sure what your exact use case is with `Shift-JIS`, but [this GitHub comment](https://github.com/php/php-src/issues/8281#issuecomment-1086672866) states _"...converting from Unicode → JISX 0201/0208 → Unicode is not a lossless conversion"_ (which you probably already know). But that comment also describes possible workarounds. — skomisa, Nov 01 '22 at 06:07

score 1 · Accepted Answer · answered Nov 02 '22 at 22:40

Operating system

Windows
- 1993: Windows adopted Unicode 1.0 with NT 3.1 - back then Unicode was what is nowadays known as UCS-2. That Windows version also introduced NTFS (New Technology File System), which also stores every filename in UCS-2 like manner (16 bit codepoints).
- 2000: With NT 5.0 (aka Windows 2000) there was a shift/improvement from UCS-2 to UTF-16 - both OS and encoding became available in this year.
Since then nothing has changed. Internally, Windows uses 16 bit codepoints for almost 30 years already, and thanks to UTF-16 also newest codepoints such as Emojis are supported. Its API works the same way, with compatibility functions for byte-wise encodings merely being stubs that convert the input to UTF-16. See also
Unix: most distributions use UTF-8 by default, because it's most backward compatible while being future proof enough.

Programming language

Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.

Pascal: based in 1970 the String was just an array of bytes, not even necessarily meaning text. And for text ASCII or one of the other single-byte encodings could easily be dealt with.
Delphi: adopted as per Windows WideString, dealing with 16 bit per character, to perfectly make use of the WinAPI and its Unicode support. Later additions also emerged the UTF8String, which works with bytes again, but not necessarily only one byte per character. But also creations such as UCS4String are available since 2009, eating 4 bytes per character.
Free Pascal: stays with the old String but always defaults to UTF-8 encoding. While this always needs conversion when using the WinAPI it is also more platform independent. Several other String (compatibilty) types also exist, each with different memory usage.
ECMAScript (JavaScript): as per standard an engine should use UTF-16 for texts. See also JavaScript strings - UTF-16 vs UCS-2?
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?

Application/program

Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.

Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.

Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.

Conversion

The most common approach is to use a common denominator:

Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

Your information about Delphi is wrong. Its `WideString` was originally introduced to support Windows ActiveX/COM APIs, but by extension also Win32 UCS-2/UTF-16 APIs as well. But its `String` type was *never* `WideString`. It was `AnsiString` (which was originally like the original Pascal `String`, holding arbitrary bytes) until 2009 when `String` became `UnicodeString` instead (not `WideString`), and `AnsiString` gained support for codepages (and thus `UTF8String` became a true UTF-8 type). — Remy Lebeau, Nov 11 '22 at 00:04
I disagree: neither did I wrote that `WideString` and `String` have something in common, nor did I wrote that `UTF8String` emerged out of `WideString`. I also didn't exclude `String` from existence. Or do you think my formulation can be misunderstood? — AmigoJack, Nov 11 '22 at 02:11
Your description of Delphi didn't mention AnsiString at all, even though it was Delphi's default string type for a long time. But you described the byte strings of Pascal and FreePascal, so why not for Delphi? Just saying, there's some inconsistency in your answer regarding Unicode in various string types. And I wasn't implying that you said UTF8String emerged from WideString. It actually emerged from AnsiString. — Remy Lebeau, Nov 11 '22 at 03:19
I didn't mention everything, since the main difference of Pascal vs. Delphi is Unicode availability (not counting Delphi 1, but who does so anyway), which is the topic here. And my link to the PDF would explain all the other different `String` types. It's okay if you edit that part to what you think would be a better description - my intention was also to make it not too long/detailed - maybe you can even link to a better overview of how all the types evolved over time. — AmigoJack, Nov 11 '22 at 08:39

What determines how strings are encoded in memory?

1 Answers1

Operating system

Programming language

Application/program

Conversion