Which Unicode encoding does the Linux kernel use?

Question

I have learned that Windows uses UTF-16LE on x86/x64 systems. What about Linux? Which Unicode encoding does it use: UTF-16LE or UTF-32?

What makes you think Linux favors any particular encoding? Are you asking whether common Linux distributions assume that configuration files are encoded using a particular encoding or whether syscalls assume that inputs are strings of code-points encoded using a particular encoding? — Mike Samuel, Apr 07 '12 at 03:50
Why do you mention the processor architecture? Are you under the impression that the architecture for which you compile Linux affects the encoding beyond endianness? — Mike Samuel, Apr 07 '12 at 03:52
@Mike Samuel: I am asking which encoding do syscalls assume? — Jichao, Apr 07 '12 at 03:52
Somewhat related to [Why is it that UTF-8 encdoing is used when interacting with a Unix or Linux Environment](http://stackoverflow.com/questions/164430/why-is-it-that-utf-8-encoding-is-used-when-interacting-with-a-unix-linux-environ/). — Jonathan Leffler, Apr 07 '12 at 03:59
[UTF-32](http://www.unicode.org/faq//utf_bom.html) comes in BE and LE forms too. — Jonathan Leffler, Apr 07 '12 at 04:09

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

4

http://www.xsquawkbox.net/xpsdk/mediawiki/Unicode says

Linux

On Linux, UTF8 is the 'native' encoding for all strings, and is the format accepted by system routines like fopen().

so Linux is like Plan 9 in that respect, and boost::filesystem and Unicode under Linux and Windows notes

It looks to me like boost::filesystem under Linux does not provide a wide character string in path::native(), despite boost::filesystem::path having been initialized with a wide string.

which would rule out UTF-16 and UTF-32 since all variants of those require wide character support -- NUL bytes allowed inside strings.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 07 '12 at 04:01

Mike Samuel

118,113
30
216
245

I thought Linux only sees paths as bytes, not as UTF-8 characters. – Melab Sep 19 '17 at 21:18
1

@melab, [lwn](https://lwn.net/Articles/71472/) tends to agree that the kernel is charset-agnostic and treats paths as nul-terminated byte arrays. One ignores userspace conventions at one's peril though. – Mike Samuel Sep 19 '17 at 21:25

score 3 · Answer 2 · answered Apr 07 '12 at 03:50

3

Generally Unix prefers UTF-8. This document suggests that Linux kernel does too.

answered Apr 07 '12 at 03:50

MK.

33,605
18
74
111

Which Unicode encoding does the Linux kernel use?

2 Answers2

Linux