0

I've been diving into C/low-level programming/system design recently. As a seasoned Java developer I still remember my attemtps to pass SUN Java Certification and questions if char type in Java can be cast to Integer and how can that be done. That is what I know and remember - numbers up to 255 can be treated both like numbers or characters depending on casting.

Getting to know C I want to know more but I find it hard to find proper answer (tried googling but I usually get gazilion results how just to convert char to int in the code) how does EXACTLY it work, that C compiler/system calls transform number to character and vice versa.

AFAIK in the memory numbers are being stored. So let's assume in the memory cell we store value 65 (which is letter 'A'). So there is a value stored and suddenly C code wants to get it and store into char variable. So far so good. And then we issue printf procedure with %c formatting for given char parameter.

And here is where the magic happens - HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter). It is a base sign from raw ASCII range (not some funny emoji-style UTF sign). Does it call external STD/libraries/system calls to consult encoding system? I would love some nitty-gritty, low-level explanation or at least link to trusted source.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Chlebik
  • 646
  • 1
  • 9
  • 27
  • 1
    In short terms, you need to think about the difference between a character _value_ and its _representation_. Its value is just a number contained in a byte (from 0 to 255 or from -128 to 127 depending on its signedness). Its representation in the printf function depends on the format specifier: with `%c` you tell it to display the character corresponding to the value it contains according to the current conversion standard (e.g. ASCII). Using %d you would see just the value of the unsigned char you are passing to it. – Roberto Caboni Sep 05 '20 at 10:30

3 Answers3

4

The C language is largely agnostic about the actual encoding of characters. It has a source character set which defines how the compiler treats characters in the source code. So, for instance on an old IBM system the source character set might be EBCDIC where 65 does not represent 'A'.

C also has an execution character set which defines the meaning of characters in the running program. This is the one that seems more pertinent to your question. But it doesn't really affect the behavior of I/O functions like printf. Instead it affects the results of ctype.h functions like isalpha and toupper. printf just treats it as a char sized value which it receives as an int due to variadic functions using default argument promotions (any type smaller than int is promoted to int, and float is promoted to double). printf then shuffles off the same value to the stdout file and then it's somebody else's problem.

If the source character set and execution character set are different, then the compiler will perform the appropriate conversion so the source token 'A' will be manipulated in the running program as the corresponding A from the execution character set. The choice of actual encoding for the two character sets, ie. whether it's ASCII or EBCDIC or something else is implementation defined.

With a console application it is the console or terminal which receives the character value that has to look it up in a font's glyph table to display the correct image of the character.

Character constants are of type int. Except for the fact that it is implementation defined whether char is signed or unsigned, a char can mostly be treated as a narrow integer. The only conversion needed between the two is narrowing or widening (and possibly sign extension).

luser droog
  • 18,988
  • 3
  • 53
  • 105
  • The execution character set specifies the mapping from `'A'` to 65 (assuming you meant the character constant). 5.1.1.2.5: "Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set;" – Paul Hankin Sep 05 '20 at 10:12
  • Thanks. I've edited to incorporate that. Am I missing (or misstating) anything else? – luser droog Sep 05 '20 at 10:19
4

"HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter)."

It usually doesn't, and it does not even need to. Even the compiler does not see characters ', A and ' in the C language fragment

char a = 'A';
printf("%c", c);

If the source and execution character sets are both ASCII or ASCII-compatible, as is usually the case nowadays, the compiler will have among the stream of bytes the triplet 39, 65, 39 - or rather 00100111 01000001 00100111. And its parser has been programmed with a rule that something between two 00100111s is a character literal, and since 01000001 is not a magic value it is translated as is to the final program.

The C program, at runtime, then handles 01000001 all the time (though from time to time it might be 01000001 zero-extended to an int, e.g. 00000000 00000000 00000000 01000001 on 32-bit systems; adding leading zeroes does not change its numerical value). On some systems, printf - or rather the underlying internal file routines - might translate the character value 01000001 to something else. But on most systems, 01000001 will be passed to the operating system as is. Then on the operating system - or possibly in a GUI program receiving the output from the operating system - will want to display that character, and then the display font is consulted for the glyph that corresponds to 01000001, and usually the glyph for letter 01000001 looks something like

A

And that will be displayed to the user.

At no point does the system really operate with glyphs or characters but just binary numbers. The system in itself is a Chinese room.


The real magic of printf is not how it handles characters, but how it handles numbers, as these are converted to more characters. While %c passes values as-is, %d will convert such a simple integer value as 0b101111000110000101001110 to stream of bytes 0b00110001 0b00110010 0b00110011 0b00110100 0b00110101 0b00110110 0b00110111 0b00111000 so that the display routine will correctly display it as

12345678

  • So AFAIU it's actually OS/display manager/application showing the output (in pure C bare-bones linux machine - the terminal) of the incoming bytes from the compiled C program? In the 'execution context' (whatever that is) C program does not differentiate whether it's a character or just a number. – Chlebik Sep 05 '20 at 13:03
  • 1
    @Chlebik well, something like that but... you always write **bytes** to stdout, and these bytes, small integers, are called **characters** in C. It is the duty of the receiver, whatever it is, to decode these bytes for display, if necessary, and to convert each byte *or sequence of bytes* to *glyphs* for display. – Antti Haapala -- Слава Україні Sep 05 '20 at 13:24
-1

char in C is just an integer CHAR_BIT bits long. Usually it is 8 bits long.

HOW EXACTLY printf knows that character with value 65 is letter 'A'

The implementation knows what characters encoding it uses and pritnf function code takes the appropriate action do output the letter 'A'

0___________
  • 60,014
  • 4
  • 34
  • 74
  • 1
    That is one of many ways `printf` could be implemented, but it is not typically how it is done. `printf` would typically just send a byte with the value `65` to the output stream, without caring about its meaning. If there is a terminal emulator on the other end of the stream (which is often the case), then it is the job of the terminal emulator to turn the value `65` into pixels resembling the letter `'A'` – HAL9000 Sep 05 '20 at 14:47