3

I'm using the new C++ <filesystem> library, and all the strings representing file and folder names are returned as const wchar_t* pointers. In my program, I use chars and std::string, which is fine as far as I know because all file paths can be written with ASCII characters (only one byte). There is a way to convert from std::wstring to std::string, but the <codecvt> header that's used for this is deprecated after C++17.

I was wondering, what would be the harm in simply just reading the value of each 2-byte wchar_t into a single-byte char? Whenever you have to convert from std::wstring (I'm assuming UTF-16) to single-byte std::string, any characters greater than 127 or 255 in the std::wstring can't be contained in the char, so they can't be converted in any way, right?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Zebrafish
  • 11,682
  • 3
  • 43
  • 119
  • 1
    what about https://en.cppreference.com/w/cpp/filesystem/path/string – asmmo Jun 19 '20 at 21:59
  • `char` strings can use multi-char sequences to represent values that do not fit into a single `char`. Also you should read [this](http://utf8everywhere.org) –  Jun 19 '20 at 22:15
  • 2
    _"all filepaths can be written with ASCII characters (only one byte)"_ Who says? – Asteroids With Wings Jun 19 '20 at 22:15
  • _"any characters greater than 127 or 255 in the std::wstring can't be contained in the char, so they can't be converted in any way, right?"_ Sounds like you need to do some more reading on string encodings!! – Asteroids With Wings Jun 19 '20 at 22:16
  • @AsteroidsWithWings You can store a character with a value of more than 255 in one byte? Also which characters in a filepath aren't covered by ASCII? – Zebrafish Jun 19 '20 at 22:36
  • 1
    Recommended reading: https://en.wikipedia.org/wiki/UTF-8 – Paul Sanders Jun 19 '20 at 22:39
  • @Zebrafish "*You can store a character with a value of more than 255 in one byte?*" - there are single-byte character encodings, like Windows-125x, ISO-5589-x, etc that have single-byte *representations* of Unicode characters higher than 127. For example, byte `0x80` in Windows-1252 represents Unicode character U+20AC. There are many multi-byte character encodings available (UTFs, Shift-JIS, etc). These are commonly referred to as *charsets*, which are implemented as *code pages* on Windows. – Remy Lebeau Jun 19 '20 at 22:52
  • @Zebrafish "*Also which characters in a filepath aren't covered by ASCII?*" - standard ASCII covers only Unicode characters U+0000..U+007F in bytes `0x00..0x7F` (0..127). But, there are "Extended ASCII" encodings that can cover higher Unicode characters in bytes `0x80-0xFF`, but these are non-standardized. And then there are the single-byte encodings, which are commonly referred to as "ANSI" encodings, though they are not really part of ANSI itself (which is its own standard). – Remy Lebeau Jun 19 '20 at 22:57
  • 2
    @Zebrafish Have a look at [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/). – Remy Lebeau Jun 19 '20 at 22:59

0 Answers0