3

I have learned that Windows uses UTF-16LE on x86/x64 systems. What about Linux? Which Unicode encoding does it use: UTF-16LE or UTF-32?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Jichao
  • 40,341
  • 47
  • 125
  • 198
  • 1
    What makes you think Linux favors any particular encoding? Are you asking whether common Linux distributions assume that configuration files are encoded using a particular encoding or whether syscalls assume that inputs are strings of code-points encoded using a particular encoding? – Mike Samuel Apr 07 '12 at 03:50
  • Why do you mention the processor architecture? Are you under the impression that the architecture for which you compile Linux affects the encoding beyond endianness? – Mike Samuel Apr 07 '12 at 03:52
  • 1
    @Mike Samuel: I am asking which encoding do syscalls assume? – Jichao Apr 07 '12 at 03:52
  • 2
    Somewhat related to [Why is it that UTF-8 encdoing is used when interacting with a Unix or Linux Environment](http://stackoverflow.com/questions/164430/why-is-it-that-utf-8-encoding-is-used-when-interacting-with-a-unix-linux-environ/). – Jonathan Leffler Apr 07 '12 at 03:59
  • 1
    [UTF-32](http://www.unicode.org/faq//utf_bom.html) comes in BE and LE forms too. – Jonathan Leffler Apr 07 '12 at 04:09

2 Answers2

4

http://www.xsquawkbox.net/xpsdk/mediawiki/Unicode says

Linux

On Linux, UTF8 is the 'native' encoding for all strings, and is the format accepted by system routines like fopen().

so Linux is like Plan 9 in that respect, and boost::filesystem and Unicode under Linux and Windows notes

It looks to me like boost::filesystem under Linux does not provide a wide character string in path::native(), despite boost::filesystem::path having been initialized with a wide string.

which would rule out UTF-16 and UTF-32 since all variants of those require wide character support -- NUL bytes allowed inside strings.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • I thought Linux only sees paths as bytes, not as UTF-8 characters. – Melab Sep 19 '17 at 21:18
  • 1
    @melab, [lwn](https://lwn.net/Articles/71472/) tends to agree that the kernel is charset-agnostic and treats paths as nul-terminated byte arrays. One ignores userspace conventions at one's peril though. – Mike Samuel Sep 19 '17 at 21:25
3

Generally Unix prefers UTF-8. This document suggests that Linux kernel does too.

MK.
  • 33,605
  • 18
  • 74
  • 111