2

Note: I know that reading an uninitialized string is undefined behaviour. This question is specifically about the GCC implementation.

I am using GCC version 6.2.1 and I have observed that uninitialized strings of length greater than 100 or so are initialized to "". Reading an uninitialized string is undefined behaviour, so the compiler is free to set it to "" if it wants to, and it seems that GCC is doing this when the string is long enough. Of course I would never rely on this behaviour in production code - I am just curious about where this behaviour comes from in GCC. If it's not in the GCC code somewhere then it's a very strange coincidence that it keeps happening.

If I write the following program

/* string_initialization.c */
#include <stdio.h>

int main()
{
  char short_string[10];
  char long_string[100];
  char long_long_string[1000];

  printf("%s\n", short_string);
  printf("%s\n", long_string);
  printf("%s\n", long_long_string);

  return(0);
}

and compile and run it with GCC, I get:

$ ./string_initialization
�QE�


$

(sometimes the first string is empty as well). This suggests that if a string is long enough, then GCC will initialize it to "", but otherwise it will not always do so.

If I compile the following program with GCC and run it:

#include <stdio.h>

int main()
{
  char long_string[100];
  int i;

  for (i = 0 ; i < 100 ; ++i)
  {
    printf("%d ", long_string[i]);
  }
  printf("\n");

  return(0);
}

then I get

0 0 0 0 0 0 0 0 -1 -75 -16 0 0 0 0 0 -62 0 0 0 0 0 0 0 15 84 -42 -17 -4 127 0 0 14 84 -42 -17 -4 127 0 0 69 109 79 -50 46 127 0 0 1 0 0 0 0 0 0 0 -35 5 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -112 5 64 0 0 0 0 0 80 4 64 0 0 0 0 0 16 85 -42 -17 

so just the start of the string is being initialized to 0, not the whole thing.

I'd like to look into the GCC source code to see what the policy is, but I don't know that code base well enough to know where to look.

Background: My CS student turned in some work in which they declared a string to have length 1000 because "otherwise strange symbols are printed". You can probably guess why. I want to be able to give them a good answer as to why this was going on and why their "fix" worked.

Update: Thanks to those of you who gave useful answers. I've just found out that my computer prints out an empty string if the string is of length 1000, but garbage if the string is of length 960. See pts's answer for a good explanation. Of course, all this is completely system-dependent and is not part of GCC.

aaaaaa123456789
  • 5,541
  • 1
  • 20
  • 33
John Gowers
  • 2,646
  • 2
  • 24
  • 37
  • 6
    They are never initialized to anything. Whatever happens to be there is there. One must *never* assume anything about what is there and reading from allocated memory that hasn't been initialized is undefined. – Sami Kuhmonen Nov 10 '16 at 16:53
  • 2
    You're just reading random garbage off the stack and imagining there is some pattern to it. There isn't, it's garbage. – Jonathan Wakely Nov 10 '16 at 16:56
  • 1
    @SamiKuhmonen ...Yes, I know that. However, **when I use GCC**, I have observed that the longer strings *are* being initialized to empty strings. This question is specifically about the GCC implementation. – John Gowers Nov 10 '16 at 16:56
  • 5
    GCC implementation is not to initialize them to anything, since that's not the job of the compiler. They are uninitialized. Undefined. That's the end of it. – Sami Kuhmonen Nov 10 '16 at 16:57
  • 3
    The 'fix' didn't actually work - it was just pure luck (and may not work on any subsequent run or another machine). Also the `for` loop you use is suspicious, you are printing `char`s using `%d` which means it interprets them as signed integers (which are usually larger than chars) – UnholySheep Nov 10 '16 at 16:57
  • 5
    This is how to fix it and get rid of the undefined behavior, making it work for short and long strings alike: `char short_string[10] = { 0 }; char long_string[100] = { 0 }; char long_long_string[1000] = { 0 };` – pts Nov 10 '16 at 16:57
  • @UnholySheep If you have a copy of gcc (I used version 6.2.1) then you can try it yourself. – John Gowers Nov 10 '16 at 16:58
  • Possible dup: http://stackoverflow.com/questions/1597405/what-happens-to-a-declared-uninitialized-variable-in-c-does-it-have-a-value – P.P Nov 10 '16 at 17:00
  • 1
    @Donkey_2009 I don't see what exactly I'm supposed to see in it? You are printing random garbage and trying to determine the inner workings of the compiler this way? using `printf("%d")` with `char`s is not showing anything useful. If you want to know what GCC does, look it up? It's open source. – UnholySheep Nov 10 '16 at 17:00
  • @UnholySheep What is your output when you compile and run my code using GCC. Do you get garbage out from the two longer strings? If so, that is strange, since I always get the empty string. For the first string, of length 0, I do indeed get garbage. – John Gowers Nov 10 '16 at 17:02
  • @P.P. No, that is a completely different question. – John Gowers Nov 10 '16 at 17:03
  • 1
    @Donkey_2009 I get [random values](http://ideone.com/hPGheo). Including a 1 for the very first "number" – UnholySheep Nov 10 '16 at 17:04
  • 2
    Out of interest what happens if you declare the long array first? – user2697817 Nov 10 '16 at 17:05
  • @user2697817 If I move the long array to the beginning, all that happens is that the 'garbage' line becomes the second line. The two long strings still appear to be zero. – John Gowers Nov 10 '16 at 17:09
  • That is strange then. I do see your point. Maybe they are still allocated in the same order. – user2697817 Nov 10 '16 at 17:10
  • 2
    @SamiKuhmonen 'Undefined behaviour' does not mean that the compiler is required to print out different things on each run. It means precisely that the compiler can interpret that part of the program however it wants. If it wants to initialize all strings to "", then it's still standard-compliant. As it turns out, GCC does not behave in this way, but I think it's not unreasonable to suppose that a compiler might do so. Of course, I would never rely on that behaviour. – John Gowers Nov 10 '16 at 17:44
  • 1
    @pts My preference is for `char short_string[10] = "";`, since then it's more obvious that it's an empty string. I'd also prefer `'\0'` over `0`, to make it explicit that it's a null termination character. But this question wasn't about good coding practices - it was more a "why does this undefined behaviour turn out this way on this particular setup" kind of question. Initializing the strings to be empty strings would defeat the point of the question. – John Gowers Nov 10 '16 at 18:18

3 Answers3

5

As others have commented before, reading uninitialized data (e.g. elements of short_string) is undefined behavior according to the C standard.

If you are interested in what actually happens when compiling it by GCC and running it on Linux, here are some insights.

main is not the first function which gets run when your program starts. The entry point is usually called _start, and it calls main. What is on the stack in these uninitialized arrays when main is running depends on what has been put there before, i.e. what _start has done before calling main. What _start does depends on GCC and the libc.

To figure out what actually happens, you may want to compile your program with gcc -static -g, and run it in a debugger, something like this:

$ gcc -static -g -o myprog myprog.c
$ gdb ./myprog
(gdb) b _start
(gdb) run
(gdb) s

Instead of s you may want to issue other GDB commands to get the disassembly of _start, and run it instruction-by-instruction.

One possible explanation why your program is reading more 0s from an uninitialized long array than from an uninitialized short array, is probably that the stack was (mostly) all 0s in the beginning, before _start started running, then _start has overwritten some bytes of the stack, but the beginning of the long array is in part of the stack which hasn't been overwritten by _start, so it's still all 0s. Use a debugger to confirm.

You may also be interested in reading data from uninitialized global arrays. These arrays are guaranteed to be initialized to 0 by the C standard, and this is implemented by GCC putting them into the .bss section. See how about .bss section not zero initialized about how .bss is initialized.

Community
  • 1
  • 1
pts
  • 80,836
  • 20
  • 110
  • 183
  • Your possible explanation makes a lot of sense to me. When I run the program to print out the 100 uninitialized bytes as numerical values, the later values change, but the first few stay the same. – John Gowers Nov 10 '16 at 17:38
4

GCC doesn't initialize those strings, at all, ever. You are just seeing that the stack happens to contain zeros and you are imagining that is some intentional behaviour of the compiler. It isn't.

Compare your results with http://coliru.stacked-crooked.com/a/38f3e70be871af61 which shows that even if the first few bytes of the array happen to be zero the first time the function is called, the bytes are not zero the second time (because I made the stack dirty, and the compiler doesn't initialize the array).

You cannot assume that some undefined behaviour is reliable, repeatable or intentional. That's a very dangerous assumption.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
  • 1
    OK, thanks. I suppose in that case my question is: why is this happening more often when the strings are of length 100 than it is when they are of length 10? – John Gowers Nov 10 '16 at 17:06
  • 1
    @Donkey_2009 Probably because it's less likely that the extra stack memory has already been used for something else. – Barmar Nov 10 '16 at 17:11
  • @P.P. Please give me an example of such a comment. – John Gowers Nov 10 '16 at 17:27
  • @P.P. The value of an uninitialized variable is undefined by the standard, but that does *not* mean that a compiler is forbidden to fill it with the value 0 if it wants to. The compiler is also not forbidden to run `rm -rf /` if you read in the value of an uninitialized variable, but that's besides the point. At no point in my comments did I ever suggest that this was universal behaviour in C, and I made it very clear that my question was about a *particular implementation* of the C standards. As it turns out, the answer also depends on my system. Why do you think I don't understand UB? – John Gowers Nov 10 '16 at 17:55
  • 1
    Just because you observe some specific behaviour doesn't mean it is an intentional property of the particular implementation. GCC is not initializing your strings. – Jonathan Wakely Nov 10 '16 at 18:40
  • @JonathanWakely Yes thanks, I realize that now. I'm still not sure why PP thinks I don't understand undefined behaviour, though. – John Gowers Nov 10 '16 at 21:04
3

The answer is simple the reason this happens is due to undefined behavior caused by reading values of uninitialized variables

Giorgi Moniava
  • 27,046
  • 9
  • 53
  • 90
  • 1
    Yes, I know that it's undefined behaviour. My question is specifically about the GCC implementation, – John Gowers Nov 10 '16 at 16:55
  • 1
    It probably isn't initializing them at all. You're getting whatever happens to be lying around in memory, and quite often that's zeroes. – Fred Larson Nov 10 '16 at 17:03
  • @FredLarson Yes, but quite often it's not zeroes. When I try to print out an uninitialized string of length 10, I often get random characters printed out. When the string is longer, overwhelmingly the most likely result is that the first character is a zero. – John Gowers Nov 10 '16 at 17:12
  • 2
    @Donkey_2009: Because size 10 is small enough that it's using a part of the stack that's very likely to already have been used. Larger sizes are progressively less likely to have been used. – R.. GitHub STOP HELPING ICE Nov 10 '16 at 17:14