What is the default encoding for C strings?

Question

I know that C strings are char[] with a '\0' in the last element. But how are the chars encoded?

Update: I found this cool link which talks about many other programming languages and their encoding conventions: Link

score 9 · Accepted Answer · edited Oct 28 '21 at 00:46

9

All the standard says on the matter is that you get at least the 52 upper- and lower-case latin alphabet characters, the digits 0 to 9, the symbols ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~, and the space character, and control characters representing horizontal tab, vertical tab, form feed, alert, backspace, carriage return, and new line.

The only thing it says about numeric encoding is that all of the above fits in one byte, and that the value of each digit after zero is one greater that the value of the previous one.

The actual encoding is probably inherited from your locale settings. Probably something ASCII-compatible.

edited Oct 28 '21 at 00:46

Eric Postpischil

195,579
13
168
312

answered Oct 22 '10 at 10:55

Nietzche-jou

14,415
4
34
45

2

I guess locale is also configurable in the compiler. Just found out about gcc's -finput-charset option (http://gcc.gnu.org/onlinedocs/cpp/Invocation.html). The default seems to be UTF8. No wonder I was able to print UTF8Strings. – Plumenator Oct 22 '10 at 11:15
Does the standard also say anything about the ordinal values of alphabets? – Plumenator Oct 22 '10 at 11:25
@Plumenator: No. There is not even a requirement that `'A' < 'B'`. – Bart van Ingen Schenau Oct 22 '10 at 13:29
2

@Plumenator: The only guarantee about `strcmp` is that the output value corresponds to the numeric value of the characters in the string. It says nothing about the result maps to the alphabet. – Oliver Charlesworth Oct 22 '10 at 13:38

fresskoma · Answer 2 · 2017-12-01T13:02:29.340

8

A c string is pretty much just a sequence of bytes. That means, that it does not have a well-defined encoding, it could be ASCII, UTF8 or anything else, for that matter. Because most operating systems understand ASCII by default, and source code is mostly written with ASCII encoding, so the data you will find in a simple (char*) will very often be ASCII as well. Nonetheless, there is no guarantee that what you get out of a (char*) will be UTF8 or even KOI8.

edited Dec 01 '17 at 13:02

answered Oct 22 '10 at 10:56

fresskoma

25,481
10
85
128

Actually most modern OS use a wide character string in all internal interfaces (Win/Linux/Mac). So it is not ASCII they use. – Martin York Oct 22 '10 at 11:06
I didn't say that they use ASCII by default in their interfaces, but that they unterstand ASCII :) – fresskoma Oct 22 '10 at 11:10
2

"it does not really have any encoding" Digitally stored text always has some encoding. – Praxeolitic Nov 29 '17 at 06:00
@MartinYork Linux absolutely doesn't use wide characters internally. POSIX interfaces are byte oriented and encoding agnostic. MacOS is also POSIX with BSD heritage, I'd expect it to use byte-encoding internally too. – Yakov Galka Oct 28 '21 at 01:27

score 7 · Answer 3 · answered Oct 22 '10 at 10:47

7

The standard does not specify this. Typically with ASCII.

answered Oct 22 '10 at 10:47

Oliver Charlesworth

267,707
33
569
680

In Objective-C I'm able to create C strings using saying, char *cStr = [objcStr UTF8String], and print as printf("%s", cStr). Does it work because ASCII is a subset of UTF8? – Plumenator Oct 22 '10 at 10:54
Yes, ASCII is a subset of UTF8. – fresskoma Oct 22 '10 at 10:58
3

@Plumenator It works because UTF-8 was designed be as transparent as possible to code already handling ASCII , and because your output terminal supports UTF-8 – nos Oct 22 '10 at 10:59
+1 @nos, but to fill in some details, it works because UTF-8 guarantees that the zero byte doesn't occur in any multibyte character encoding, so `printf` will never inadvertently deliver just part of a UTF-8-encoded string to the terminal. – Marcelo Cantos Oct 22 '10 at 11:15

Marcelo Cantos · Answer 4 · 2010-10-22T11:39:49.777

1

They are not really "encoded" as such, they are simply stored as-is. The string "hello" represents an array with the char values 'h', 'e', 'l', 'l', 'o' and '\0', in that order. The C standard has a basic character set that includes these characters, but doesn't specify an encoding into bytes. It could be EBCDIC, for all you know.

edited Oct 22 '10 at 11:39

answered Oct 22 '10 at 10:47

Marcelo Cantos

181,030
38
327
365

Note: '\0' is literally the octal number 0 with a type of char. So yes the terminating character is always literally a 0. – Martin York Oct 22 '10 at 11:02
@Martin: thanks for pointing that out. I always forget whether the strange rules around null pointers apply to null characters too. – Marcelo Cantos Oct 22 '10 at 11:10
@Martin: Technically, the type of a character literal is `int` (at least it is in C)... – Oliver Charlesworth Oct 22 '10 at 11:15
@Marcelo I'm talking about all the characters. – Plumenator Oct 22 '10 at 11:23
@Plumenator: I've amended my answer accordingly. – Marcelo Cantos Oct 22 '10 at 11:40
@Oli: Opps. I am more used to c++. You are correct in C the type is int. The value however is still zero. – Martin York Oct 22 '10 at 16:34

score 1 · Answer 5 · answered Oct 22 '10 at 11:39

As other indicated already, C has some restrictions what is permitted for source and execution character encodings, but is relatively permissive. So in particular it is not necessarily ASCII, and in most cases nowadays at least an extensions of that.

Your execution environment is meant to do an eventual translation between source and execution character set. So generally you should not care about the encoding and in the contrary try to code independently of it. This why there are special escape sequences for special characters like '\n', or '\t' and universal character encodings like '\u0386'. So usually you shouldn't have to look up the encodings for the execution character set yourself.

What is the default encoding for C strings?

5 Answers5

Linked

Related