-5

In my understanding, the use of Windows filesystem functions accepting bytes (those with an A suffix) is discouraged (I didn’t find an official deprecation notice, but for example Python deprecated their use).

On Unix-derived systems, file names are stored as bytes. The encoding isn’t defined, but many systems are configured to interpret file names as UTF-8.

Since recently, it seems to be possible on Windows to set the code page to UTF-8. Is it possible to estimate how many Windows users have that code page set? Does it make sense to use the bytes-accepting filesystem API on Windows similar to how the POSIX API is used on Unix-derived systems (e.g. when porting some application from Linux to Windows)?

Manuel Jacob
  • 1,897
  • 10
  • 21
  • To the people who downvoted the question: I know that this question may be hard to ask in a good way. Can you make suggestions on how to improve the question? – Manuel Jacob Aug 06 '19 at 20:54
  • 2
    It is impossible to answer the "how many" part. The other question is subjective. But the answer is surely that it does not make sense to use the 8 bit API. Whilst it might be convenient to use UTF8 everywhere Windows isn't ready for it. – David Heffernan Aug 06 '19 at 21:09
  • @DavidHeffernan On Linux, it’s also impossible to estimate how many people are using UTF-8 as their system encoding. Still, nowadays most people seem to assume that everyone is using UTF-8. I deliberately didn’t ask how many people have that code page set. I asked whether it’s possible to do an estimatation at all. – Manuel Jacob Aug 06 '19 at 21:13
  • @DavidHeffernan So your answer was "no, it’s not possible to estimate how many" and "Windows is not ready for UTF-8 everywhere". Do you think it’s possible to restate my question in a way that this could be posted as an answer? – Manuel Jacob Aug 06 '19 at 21:18
  • 2
    No. I don't think there is any way to ask this question without it attracting valid close votes. – David Heffernan Aug 06 '19 at 21:24
  • One advice, forget everything you know from unix when trying windows programming. It's *way* different. Forget non-unicode *completely*. – Michael Chourdakis Aug 06 '19 at 22:55
  • @DavidHeffernan there is no convenience in using UTF-8 everywhere. In fact, UTF-8 is only used in a small part of programming. – Michael Chourdakis Aug 06 '19 at 22:57

1 Answers1

4

Only the newest versions of Windows 10 allows you to set the default codepage to UTF-8. This off-by-default feature is marked as Beta and has a couple of compatibility issues so I'm guessing the percentage is low.

The actual number is irrelevant because CreateFileA is just going to convert the string and call CreateFileW no matter what, and in the end, a UTF-16LE filename is going to be sent to the kernel. NTFS stores filename characters as 16-bit values and Windows interprets them as UTF-16LE strings (but without validating them AFAIK).

Anders
  • 97,548
  • 12
  • 110
  • 164
  • Thank you for your useful answer. Do you have a source for that it’s a beta feature? Are the compatibility issues documented somewhere? – Manuel Jacob Aug 06 '19 at 22:56
  • The UI itself said BETA but I'm not sure when/if that has been removed. See https://stackoverflow.com/q/56419639/3501 for at least one issue – Anders Aug 06 '19 at 23:21
  • 1
    One problem with setting the system locale to UTF-8 (codepage 65001) is that the system OEM codepage is used as the default for the console (conhost.exe). Setting the console's input codepage to UTF-8 limits reading input via `ReadFile` or `ReadConsoleA` to 7-bit ASCII since internally it doesn't support encoding to a multibyte encoding, and UTF-8 is 2-4 bytes per non-ASCII code point. It calls `WideCharToMultiByte` with a fixed internal buffer that assumes 1 byte per code point (or 2 bytes per code point for DBCS codepages). Each non-ASCII character is replaced by null in the result. – Eryk Sun Aug 07 '19 at 14:59