1

Messing around a bit with C pointers, I came across a rather strange behavior.
Consider the following code :

int 
main ()
{
   char charac = 'r';

   long long ptr = (long long) &charac;  // Stores the address of charac into a long long variable

   printf ("[ptr] points to %p containing the char %c\n", ptr, *(char*)ptr);

}

On 64-bits architectures

Now when compiled for a 64-bits target architecture (compilation command : gcc -Wall -Wextra -std=c11 -pedantic test.c -o test), everything is fine, the execution gives

> ./test 
[ptr] points to 0x7fff3090ee47 containing the char r

On 32-bits architectures

But, if the compilation targets a 32-bits arch (with compilation command : gcc -Wall -Wextra -std=c11 -pedantic -ggdb -m32 test.c -o test), the execution gives this weird result :

> ./test     
[ptr] points to 0xff82d4f7 containing the char �

The weirdest part now is if I change the printf call in the previous code to printf ("[ptr] contains the char %c\n", *(char*)ptr);, the execution gives a correct result :

> ./test     
[ptr] contains the char r

The issue seems to arise only on 32-bits arch, and I can't figure out why the printf call change causes the execution to behave differently.

PS: It's maybe worth mentioning that the underlying machine is a x86 64-bits architecture, but using the 32-bits compatibility mode triggered by the -m32 option in gcc.

programmersn
  • 582
  • 1
  • 3
  • 17
  • 4
    What is the reason you want to use a `long long` type to store a pointer? The correct way to have a pointer to `char` would be `char *`. – Some programmer dude Aug 22 '18 at 09:24
  • @Someprogrammerdude Yeah you're right, I am aware that the right way of doing it is as you said. but as I mentioned in the post, I'm purposely messing around with pointers in order to fully master the C memory model and when undefined behavior is triggered. So it's more of an academic question than of a practical one ;-) – programmersn Aug 22 '18 at 09:27
  • 3
    if you want to convert a pointer to an integer type you must use `(u)intptr_t`. And you're getting UB since you're printing `ptr` with `%p` – phuclv Aug 22 '18 at 09:28
  • 2
    Using your compiler options, you should get a whole list of warnings here. The code has undefined behavior. – Lundin Aug 22 '18 at 09:28
  • 2
    Fair enough, but then remember that one of the possibilities of UB is that it might seemingly work, making it very hard to detect. – Some programmer dude Aug 22 '18 at 09:29
  • 1
    Possible duplicate of [What happens when I use the wrong format specifier?](https://stackoverflow.com/questions/16864552/what-happens-when-i-use-the-wrong-format-specifier) – phuclv Aug 22 '18 at 09:29
  • 2
    *to fully master the C memory model and when undefined behavior is triggered.* There's nothing to master with undefined behavior. Compilers can assume undefined behavior never happens, so when you do trigger it there's no way to tell what might happen. – Andrew Henle Aug 22 '18 at 09:30
  • @phuclv I don't think the issue is related to format specifier. Maybe it has more to do with the long long var being 8 bytes longs and receiving a 4 bytes address ? – programmersn Aug 22 '18 at 09:32
  • 4
    @programmersn to print a `long long` you have to use `%lld`. Since you're using a wrong one it invokes UB and in this case it messes up the stack layout and results in the segfault – phuclv Aug 22 '18 at 09:36
  • @phuclv You were damn right, changing to %llx the format specifier made the issue disappear ! Could you please rephrase your solution in an answer and also, if you can, expand a little bit about the stack layout ? So I can mark your solution as the right one ;-) – programmersn Aug 22 '18 at 09:43
  • 2
    @programmersn Curious you choose to try `"%llx"` with `long long` rather than the [recommended `%lld`](https://stackoverflow.com/questions/51963609/using-long-long-integer-to-store-32-bit-pointer-causes-printf-to-bug#comment90877804_51963609). When a `long long` is not in `unsigned long long` range (negative values), using `"%llx"` is UB. – chux - Reinstate Monica Aug 22 '18 at 11:45
  • @chux You must be right. I chose that format in order to print the address in hexadecimal. How would you print out a signed integer with hexadecimal notation ? – programmersn Aug 22 '18 at 12:43
  • 1
    @programmersn To print a signed integer with with hexadecimal notation, simple cast to the corresponding unsigned type and then use the matching `"%x" ,"%X"`. Alternatively, with such issues code could change `ptr` to an unsigned type like `uintptr_t` or `uintmax_t` instead. – chux - Reinstate Monica Aug 22 '18 at 13:06

2 Answers2

7

You are basically cheating your compiler.

You tell printf that you pass a pointer as first parameter after the format string. But instead you pass an integer variable.

While this is always undefined behaviour, it may somehow work as long as the size of expected type and passed type are the same. That's the "undefined" in "undefined behaviour". It is also not defined to crash or immediately show bad results. It may just pretent to work while waiting to hit you from behind.

If your long long has 64 bits while a pointer only has 32 bits, the layout of your stack is broken causing printf to read from wrong location.

Depending on your architecture and tools, you have good chances that your stack looks like this when you call a function with variadic parameter list:

+---------------+---------------+---------------+
| last fixed par| Par 1   type1 | Par 2   type2 |
|    x bytes    |    x bytes    |    x bytes    | 
+---------------+---------------+---------------+

The unknown parameters are pushed on the stack and finally the last known parameter from the signature is pushed. (Other known parameters are ignored here)

Then the function can walk through the parameter list using va_arg and friends. For this purpose the function must know which types of parameters are passed. The printf function uses the format specifier to decide which parameter to consume from the stack.

Now it comes to the point where everything depends on you telling the truth.

What you tell your compiler:

+---------------+---------------+---------------+
| format  char* | Par 1   void* | Par 2     int |
|    4 bytes    |    4 bytes    |    4 bytes    | 
+---------------+---------------+---------------+

For the first parameter (%p) the compiler takes 4 bytes which is the size of a void*. Then it takes another 4 bytes (size of an int) for parameter 2 (%c).

(Note: The last parameter is printed as a character, i.e. only 1 byte will be used in the end. Due to integer type promotion rules for function calls without proper parameter type specification the parameter is stored as an int on the stack. Hence printf must also consume the bytes for an int in this case.)

Now let's look at your function call (What you really put into printf):

+---------------+-------------------------------+---------------+
| format  char* |   Par 1           long long   | Par 2     int |
|    4 bytes    |            8 bytes            |    4 bytes    | 
+---------------+-------------------------------+---------------+

You still claim to provide a pointer and a integer parameter of 4 bytes each. But now the first parameter comes with an extra 4 bytes of length which remains unknown to the printf function. As you have told it, the function reads 4 bytes for the pointer. This may be in line with the first 4 bytes of the long long but the remaining 4 bytes are not consumed. Now the next 4 bytes that are used for the %c format, are read but we are still reading the second half of your long long Whatever this may be, it is not what you want to. Finally the pushed integer is still untouched when the function returns.

That's the reason why you should not mess with weird type casting and wrong types.

And that's also the reason why you should look at your warnings during compiling.

Gerhardh
  • 11,688
  • 4
  • 17
  • 39
  • Could you expand a little more on *stack layout broken* ? What do you mean by that ? – programmersn Aug 22 '18 at 16:54
  • 1
    @programmersn: the same thing I meant by getting the calling convention wrong in comments on the other answer. [Function called in a file without a prototype produce different results on ARM and x86-64](https://stackoverflow.com/a/47170047) is another example of the same problem. (Nice update, \@Gerhardh. First version was a little vague, and didn't go into as much detail about exactly what does happen in this case for this calling convention.) – Peter Cordes Aug 22 '18 at 21:41
  • @gerhardh The update cleared away most of the difficulties. Wonderful explanation ! I just have one last question: why is the `%c` format specifier translated into a 4-bytes memory slot by `printf` ? Was a `char` not meant to take up 1 byte in memory ? I've always thought that only *wide chars* were expanding on 4 bytes in order to represent UTF-32 characters, and in any way the length modifier `l` could be used in `%lc` in order to represent a 4-bytes character. – programmersn Aug 23 '18 at 21:20
  • 2
    If you have a function that does not specify the types of parameters, any integer like variable is promoted to `int` before calling the function. This means for a `char` or `short` there are the same number of bytes used as for `int`. As a result the function must fetch the exact same number when it handles that parameter. Of course after fetching the parameter from the stack, only 1 byte is used for printing when `%c` is used. – Gerhardh Aug 24 '18 at 08:36
  • 1
    @programmersn: Even if you *do* have a prototype like `void foo(char c, short s, int i)`, the minimum width of an arg-passing slot is 4 or 8 bytes in normal 32 or 64-bit calling conventions. Narrow args leave the upper bytes unused. Look at compiler-generated asm for a caller of such a function: https://godbolt.org/z/d7rLcU. it just uses `push` (32-bit stack-args calling convention), or puts args in registers, with the whole register dedicated to one arg (instead of packing multiple narrow args into the first 64-bit register). – Peter Cordes Aug 24 '18 at 14:47
  • 1
    @PeterCordes In practice probably yes. But the C standard would not mandate it if a prototype is present. An implementation would be allowed to use less than size of an integer. Because then the type is `char` and you put that type in your considerations wrt. alignment and calling conventions. But without a prototype it is converted to `int` before any alignment and calling conventions are applied. It might be the same result or not. – Gerhardh Aug 24 '18 at 15:25
  • 1
    The C standard doesn't mandate anything: any way your program could detect / depend on calling convention details is undefined behaviour. So yes, a calling convention that packs args tightly is of course possible. I just thought it was interesting/useful to point out the "stack slot" way of thinking about arg-passing, in addition to your correct comment that the default promotions to `int` or `double` mean that `%c` really accepts an `int` which is then converted to `unsigned char`. The default-promotion rules for unprototyped might have been chosen because of PDP-11 calling conventions. – Peter Cordes Aug 24 '18 at 15:40
2

One big issue: you are using the wrong type for integer/pointer shenanigans. The type intptr_t is an integer type that can store a pointer.

So, what goes wrong on the 32-bit architecture?

The type long long int is (with gcc) a 64-bit type. However, the printf command with %p format expects to receive a 32-bit pointer, not a 64-bit one.

The call to printf will have on the call stack: (illustrative purposes only, details may differ)

pointer to format string
ptr (8 bytes)
*(char *)ptr (at least 1 byte, likely 4)

printf reads the format string, discovers that it should receive a 32-bit pointer and a char. It then reads the first 4 bytes of ptr as the pointer to read and next 1-4 bytes as the character to print. It never even knows that there was more data, the actual character it should have printed, on the stack.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Niko Kiirala
  • 194
  • 3
  • 2
    The same logic also applies for 32-bit x86 calling conventions that use register args, like Windows `__fastcall` or `__vectorcall`. Or ARM / MIPS / etc. – Peter Cordes Aug 22 '18 at 10:34
  • 1
    And yes, `*(char*)ptr` is 1 significant byte padded to a 4-byte stack slot (or register) in any normal calling convention. The upper bytes are potentially garbage. Most 32-bit x86 calling conventions use at most two registers for arg-passing, so yes, the char will be on the stack there. (see https://stackoverflow.com/tags/x86/info for ABI docs). But on 32-bit ARM it'll be in `r3`. – Peter Cordes Aug 22 '18 at 10:37
  • @Petercordes I still do not understand why does it print a good result when using `%llx` (8 bytes) format specifier but yields a wrong result when using `%p` (4 bytes) specifier. I mean `*(char*)ptr` should interpret as a 1-byte memory location to lookup at, in any ways. Right ? – programmersn Aug 22 '18 at 11:16
  • 1
    @programmersn: because you're violating the calling convention. The `%p` conversion only consumes the first 4 bytes of the 8+1 bytes you passed, so the `%c` conversion is taking the low byte of high half of the `long long` you passed (because x86 is little-endian). Compare the asm output for the two versions of your code and look at where stuff ends up on the stack. The one passing a 4-byte arg is the one that matches where `printf` with `%p` will look for args. – Peter Cordes Aug 22 '18 at 11:25
  • 1
    @programmersn: See also [Function called in a file without a prototype produce different results on ARM and x86-64](https://stackoverflow.com/a/47170047) for a more detailed explanation of this. (The question is comparing x86-64 vs. ARM, but it would be the same for 32-bit x86 because they're doing the same thing: passing a 64-bit arg to a function that's expecting a 32-bit arg followed by another arg.) – Peter Cordes Aug 22 '18 at 11:30
  • "The type intptr_t is an integer type that can store a pointer." is correct when the _optional types_ `(u)intptr_t` exists - very common. Yet now how to print that? `printf ("%jx\n", (uintmax_t) some_uintptr_t_object);` commonly works - perhaps guaranteed to work. – chux - Reinstate Monica Aug 22 '18 at 11:51
  • 1
    @programmersn Same byte size and mis-matched print specifiers/Arguments can still cause problems. Different types of arguments in a `...` functions like `printf()`. need not get passed to the function is a like manner on the stack or what ever. This mis-match is common with `"%u"` and `float`. Moral of the story - enable all compiler warnings and use matching specifiers/arguments. – chux - Reinstate Monica Aug 22 '18 at 11:56
  • 1
    @chux: Probably your best bet for printing a `uintptr_t` is `%p` with `(void*)p`, if you're happy with pointer formatting. Oh, [cppreference says that `PRIdPTR` from ``](https://en.cppreference.com/w/c/types/integer) is a macro for the equivalent of `%d` for signed `intptr_t`. So problem solved for C99. Upcasting to `uintmax_t` will increase code-size when it's wider than a pointer, especially wider than an arg-passing slot. (The x32 ABI for x86-64 (32-bit pointers), has 8-byte arg-passing slots.) – Peter Cordes Aug 22 '18 at 21:39