1

I've been fooling around a little with C pointers, and I came up with the following example, but I can't explain the behaviour for each case. Here is the code, but I'm not sure exactly why it behaves the way it does.

#include <stdio.h>
#include <stdlib.h>

void return_void(){
  printf("in return_no_malloc\n");

  int* p;
  *p = 2;
  printf("assigned a value for the pointer to point to\n");
  printf("Scenario %d\n", *p);
  printf("function ends\n\n\n");
}

int* return_no_malloc(){
  printf("in return_no_malloc\n");
  int* p;
  *p = 3;
  printf("assigned a value for the pointer to point to\n");
  printf("Scenario %d\n", *p);
  printf("function ends \n \n \n ");
  return p;
}

int* return_malloc(){
  printf("in return_malloc\n");
  int* p = (int*)malloc(sizeof(int));
  *p = 4;
  printf("assigned a value for the pointer to point to\n");
  printf("Scenario %d\n", *p);
  printf("function ends \n \n \n ");
  return p;
}


int main(void) {
  
  //scenario 1: locally declare a pointer and print what it points to 
  // this works (should it?)
  printf("in main\n");
  int* p;
  *p = 1;
  printf("assigned a value for the pointer to point to\n");
  printf("Scenario %d\n", *p);
  printf("\n\n\n");
  //////////////////////////////////////////////////////////////////////////


  //scenario 2: do what you did in scenario 1, but in a helper function
  //return_void(); //causes a seg fault only after exiting the main (why?)
  ///////////////////////////////////////////////////////////////////////////
  
  
  //scenario 3: call a function that returns a pointer to an int that you don't malloc
  //without storing the result of the call
  int* result = return_no_malloc(); //segmentation fault only after exiting main
  ///////////////////////////////////////////////////////////

  //scenario 4: use malloc to make space for what the pointer will point to 
  int* q = return_malloc(); //works fine

  printf("main is done \n");
  //if scenario 2 and 3 are not commented out, the seg fault occurs here
  return 0;
}

In scenario 1, we declare a pointer to an int, give it the value 1, and print. This works. I'm a little confused as to why it works, since we never allocated space for the actual int that p is supposed to point to.

In scenario 2, we get a segmentation fault after we finish all the instructions in main. Same story for scenario 3. Why in these two scenarios does everything compile and run fine (Still don't know why this is, from scenario 1), but ONLY UNTIL we have no more instructions in main??

I know that scenario 4 should work, and it does.

Can someone explain what's going on here? My understanding is that if you want a pointer to point to something, you need to reserve a slot in memory to it.

P. Gillich
  • 289
  • 1
  • 9
  • 3
    undefined behavior does not always mean a segfault. It can be corrupting random memory that your not using right now. – Bill Lynch Mar 16 '21 at 01:07
  • @BillLynch could you please explain where the undefined behaviour is? – P. Gillich Mar 16 '21 at 01:10
  • 1
    Pointers must be initialized with a valid address before they are dereferenced. Only scenario 4 follows that rule. – user3386109 Mar 16 '21 at 01:10
  • @user3386109 Then I don't understand why the above code compiled and ran in scenario 1, can you elaborate a little more? – P. Gillich Mar 16 '21 at 01:47
  • 2
    C has rules. The rules aren't for the compiler to enforce. The rules aren't for the runtime to enforce. **The rules are for you to follow.** If you don't follow the rules, your code is garbage, and its behavior is undefined. Possible behaviors are segfaults, erratic outputs, or acting like nothing is wrong. – user3386109 Mar 16 '21 at 02:12
  • See the top answer here: https://stackoverflow.com/questions/2397984/undefined-unspecified-and-implementation-defined-behavior – M.M Mar 16 '21 at 02:39

1 Answers1

1

TL;DR

Cases 1, 2, 3 are all the same "corrupting the stack" - undefined behavior, as the Standard would put it. The problem is that you can never know what is the content of some address, which means that when your program starts executing you do not know what's the value in some variable. Since a pointer holds an address (that's technically incorrect if you're pedantic about the standard, but let's make things simple here), it does point somewhere and anything can happen when you dereference it. So don't do it.

Case 4 is when you get more memory from the heap (read links posted). malloc will return a memory address to your pointer, typically an address in the heap (which will grow with calls to sbrk(2) system call).


The Fun Part

DISCLAIMER: nothing below is a derivation of the standard, nor tries to be. On the contrary. Everything here is the product of tinkering and reading about computer architecture and internals (both books and source codes). But mostly tinkering - there's no better way to learn the gritty details then tinkering.

Now, of course there's undefined behavior. But computers are crazy things, programming is even crazier, so surely there's something down the rabbit hole - as you've guessed. There's the Standard and there's undefined behavior. And there's the real world, where variables live in memory, there are these things called stack-frames, instruction pointers, curiosity and tinkering.

There's also a lot of dependency on implementation of hardware, operating system, compiler, assemblers, linkers, loaders and a lot of other things. With all that said, let's go back in time to the 60's or 70's, where computer architecture was simple to understand and there were no people that liked standards.

In simple terms, when a program starts an instruction at a specific location is executed. Then, some code put there by compiler/assembler/linker will set up the running "environment" and call your main function. But before it does, it saves the contents of it's registers in memory, in the stack (this is the program's context, so to speak) - the saving is done by pushing the contents of registers in the stack. Very important to this is the instruction pointer, which contains the address of the next instruction to be executed when the main function returns.

When all of mains instructions are executed and you call that return, the computer pops the contents of the memory in reverse order, into the registers.

So, imagine this:

push rax ; general purpose register
push rbx ; general purpose register
push rcx ; general purpose register
push rip ; instruction pointer (our return address)

And in reverse, to restore the values:

pop rip
pop rcx
pop rbx
pop rax

The thing is, if we were to "corrupt" the stack, then the value poped into rip would be anything but the original correct value (also known as "normal execution flow").

Right after setting up the stack frame, C compilers generally make room for variables. Therefore, all variables you declare are generally close to the stack, which means you could overwrite them if you wrote more bytes then you should. Keep in mind that stacks generally start at high addresses and grow towards lower address.

So, when you do something like this:

int main(void)
{
    char a, b, c, d;
    return 0;
}

And compile without any optimizations, the stack should look like this:

84   83   82   81   80   7f   7e   7d   7c   7b   7a   79   78
+----+----+----+----+----+----+----+----+----+----+----+----+
|        RIP        |  a |  b |  c |  d |                   |
+----+----+----+----+----+----+----+----+----+----+----+----+
                    |
           <--------+ we write FROM here to THERE, in that direction

This means that, if RIP started at address 0x80 and is 4 bytes, and a,b,c,d where 1 byte each, their addresses would be 0x7F, 0x7E, 0x7D, 0x7C, respectively (very implementation dependent). So, to write a value to d variable, you would write 1 byte starting at address 0x7C (this would fill 8 bits worth of memory towards the address 0x7D.

Now, if you wrote TWO bytes, starting at position 0x7C, you would write at positions 0x7C and 0x7D. Therefore, you would write over d and c variables. IFF your compiler/arch behaves like mine, try this:

/* smash.c */
#include <stdio.h>

int main(int argc, char *argv[])
{
    char a, b, c, d;
    int *p;

    p = &d;
    *p = 0x41424344;

    printf("a: %c, b: %c, c: %c, d: %c\n", a, b, c, d);

    return 0;
}

Compile, run, expect:

$ gcc -o smash smash.c
$ ./smash
a: A, b: B, c: C, d: D
$

Now you know how overwriting things in stack works. Now, when you declare a pointer the compiler will allocate space for it to hold one "address" (there's no such thing as a memory address in the Standard, as I recall, because they are more abstract). However, I like memories and address, so in simple 60's terms:

A pointer holds a memory address. When you call main this is what the stack looks like (very simplified. The second row are the contents of the memory addresses):

84   83   82   81   80   7f   7e   7d   7c   7b   7a   79   78 (hex address)
+----+----+----+----+----+----+----+----+----+----+----+----+
|        RIP        |  a |  b |  c |  d |         p         |  (var names)
+----+----+----+----+----+----+----+----+----+----+----+----+
|      0x90         |  0 |  0 |  0 |  0 |      0x80         |  (contents)
+----+----+----+----+----+----+----+----+----+----+----+----+

If you try to dereference pointer p with *p, what the computer understands is:

  1. Get the value of p (which is 0x80)
  2. Dereference that (in other words, access the location 0x80 and retrieve it's content)
  3. What is in address 0x80? The value of the RIP register, or 0x90.

And if you try something like *p = 1...

    1. Get the value of p (which is 0x80)
  1. At that location (0x80), write the value 1.

Then the stack would be:

84   83   82   81   80   7f   7e   7d   7c   7b   7a   79   78 (hex address)
+----+----+----+----+----+----+----+----+----+----+----+----+
|        RIP        |  a |  b |  c |  d |         p         |  (var names)
+----+----+----+----+----+----+----+----+----+----+----+----+
|         1         |  0 |  0 |  0 |  0 |      0x80         |  (contents)
+----+----+----+----+----+----+----+----+----+----+----+----+

And nothing happens right away, because there is memory available to store that number 1 of yours. However, you stored it in the place of the instruction pointer, which will be used when main returns. When that happens, the value 0x01 will be loaded into the program counter,and the CPU will try to execute an instruction at that address. Now, what is the contents of memory address 0x01? I don't know, you don't know, the Standard people certainly do NOT know - this is undefined behavior, because we don't know what will happen.


Further reading

If you want to learn more about low level programming:
  • Structured Computer Organization: Great book that has this and more.
  • Low level programming: lots of code.
  • aleph1 article. Read it and try all the codes.
  • You will not blow your CPU like many people tell you. Write programs that have pointers. Do crazy things with them. Write down what you see.
  • gcc -S smash.s will get you the assembly output of your program. Very enlightening.
  • GNU Debugger - gdb(1): this is THE tool for low-level tinkering.
Enzo Ferber
  • 3,029
  • 1
  • 14
  • 24
  • 1
    Wow. This is an incredible answer. Thank you so, so much!!! I think I now see the error of my ways--moral of the story, make sure a pointer points to a valid address (eg of an `int` I already declared). – P. Gillich Mar 16 '21 at 02:34