45

A long time ago I used to program in C for school. I remember something that I really hated about C: unassigned pointers do not point to NULL.

I asked many people including teachers why in the world would they make the default behavior of an unassigned pointer not point to NULL as it seems far more dangerous for it to be unpredictable.

The answer was supposedly performance but I never bought that. I think many many bugs in the history of programming could have been avoided had C defaulted to NULL.

Here some C code to point out (pun intended) what I am talking about:

#include <stdio.h>

void main() {

  int * randomA;
  int * randomB;
  int * nullA = NULL;
  int * nullB = NULL;


  printf("randomA: %p, randomB: %p, nullA: %p, nullB: %p\n\n", 
     randomA, randomB, nullA, nullB);
}

Which compiles with warnings (Its nice to see the C compilers are much nicer than when I was in school) and outputs:

randomA: 0xb779eff4, randomB: 0x804844b, nullA: (nil), nullB: (nil)

Adam Gent
  • 47,843
  • 23
  • 153
  • 203
  • 1
    What an interesting question :) – invert Jun 23 '10 at 13:29
  • 12
    C says: Trust the programmer. Programmers learn to track their variables. – u0b34a0f6ae Jun 23 '10 at 13:44
  • 6
    I think you're confusing C with a high level language. It isn't. – Skilldrick Jun 23 '10 at 13:54
  • 4
    @Adam Gent: this distinction only matters if your code accesses the values of uninitialized variables, which it should never do. Most modern C compilers will complain about that, for good reason. – eemz Jun 23 '10 at 13:54
  • And here I thought all of the good pointer questions had been asked. +1 :) – Tim Post Jun 23 '10 at 13:56
  • I understand C's choices for performance now. However IMHO do not think C should be used for mission critical embedded devices where safety is important. – Adam Gent Jun 23 '10 at 14:40
  • 5
    The correct printf format specifier for a pointer is `%p`, not `%d`. – tomlogic Jun 23 '10 at 14:51
  • 1
    The default behavior is for the programmer to initialize the variable when and how they see fit. There's no point in having the function initialize your pointers to NULL, only to have you assign some other value to them at the start of the function. – tomlogic Jun 23 '10 at 14:57
  • @tomlogic I adjusted it to use %p. Yes tomlogic I can see why that is but I am used to defining my variables when I need them and I try to set them so that I don't reset them (Think "final" keyword in Java) – Adam Gent Jun 23 '10 at 15:55
  • 1
    @Adam Gent: Well, there's no reason why you can't do that in C (at least, the modern C standard) too. Instead of `int a;`, then sometime later `a = x * 42;`, just put `const int a = x * 42;`. C99 lets you mix declarations and code, so you can put that declaration right before `a` is first needed. – caf Jun 23 '10 at 22:46
  • I find myself slightly baffled at the degree to which this has been voted up: the answer is *absolutely obvious* to anyone who understands c, because a pointer is just like any other variable... – dmckee --- ex-moderator kitten Jun 24 '10 at 20:20
  • @dmckee so you found the correct answer marked obvious about static pointers? You also must find it obvious that the Most C compilers have been optimized for many things like tail recursion and for check for buffer overrun errors. Why is it not a good question to ask why c compilers have not been optimized or improved for uninitialized pointers. – Adam Gent Jun 24 '10 at 22:18
  • @Adam: It's not that it is not a good question, it just that because a pointer is a variable *just like* any other variable, it's initialization behavior *must* have the same semantics. That's not negotiable. So, yes, the behavior of static pointers is obvious. – dmckee --- ex-moderator kitten Jun 25 '10 at 01:25

11 Answers11

41

Actually, it depends on the storage of the pointer. Pointers with static storage are initizalized with null pointers. Pointers with automatic storage duration are not initialized. See ISO C 99 6.7.8.10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

  • if it has pointer type, it is initialized to a null pointer;
  • if it has arithmetic type, it is initialized to (positive or unsigned) zero;
  • if it is an aggregate, every member is initialized (recursively) according to these rules;
  • if it is a union, the first named member is initialized (recursively) according to these rules.

And yes, objects with automatic storage duration are not initialized for performance reasons. Just imagine initializing a 4K array on every call to a logging function (something I saw on a project I worked on, thankfully C let me avoid the initialization, resulting in a nice performance boost).

Georg Fritzsche
  • 97,545
  • 26
  • 194
  • 236
ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Certainly initializing a 4K array can't be that slow since languages like Java do this all the time (initialize for all references). You must have had a very hi-performance project – Adam Gent Jun 23 '10 at 17:52
  • 5
    Yes, performance was important on that project. Add to that the fact that ~99.99% of the time the logging function just checked its parameters against some flags in shared memory and see that logging was disabled and returned. Imagine my expression when I discovered that initialization was on one of the top 5 places of a cachegrind profile. – ninjalj Jun 23 '10 at 18:23
26

Because in C, declaration and initialisation are deliberately different steps. They are deliberately different because that is how C is designed.

When you say this inside a function:

void demo(void)
{
    int *param;
    ...
}

You are saying, "my dear C compiler, when you create the stack frame for this function, please remember to reserve sizeof(int*) bytes for storing a pointer." The compiler does not ask what's going there - it assumes you're going to tell it soon. If you don't, maybe there's a better language for you ;)

Maybe it wouldn't be diabolically hard to generate some safe stack clearing code. But it'd have to be called on every function invocation, and I doubt that many C developers would appreciate the hit when they're just going to fill it themselves anyway. Incidentally, there's a lot you can do for performance if you're allowed to be flexible with the stack. For example, the compiler can make the optimisation where...

If your function1 calls another function2 and stores its return value, or maybe there are some parameters passed in to function2 that aren't changed inside function2... we don't have to create extra space, do we? Just use the same part of the stack for both! Note that this is in direct conflict with the concept of initialising the stack before every use.

But in a wider sense, (and to my mind, more importantly) it's aligned with C's philosophy of not doing very much more than is absolutely necessary. And this applies whether you're working on a PDP11, a PIC32MX (what I use it for) or a Cray XT3. It's exactly why people might choose to use C instead of other languages.

  • If I want to write a program with no trace of malloc and free, I don't have to! No memory management is forced upon me!
  • If I want to bit-pack and type-pun a data union, I can! (As long as I read my implementation's notes on standard adherence, of course.)
  • If I know exactly what I'm doing with my stack frame, the compiler doesn't have to do anything else for me!

In short, when you ask the C compiler to jump, it doesn't ask how high. The resulting code probably won't even come back down again.

Since most people who choose to develop in C like it that way, it has enough inertia not to change. Your way might not be an inherently bad idea, it's just not really asked for by many other C developers.

detly
  • 29,332
  • 18
  • 93
  • 152
  • 1
    malloc and delete? no conforming program has any trace of that! ;) – jk. Jun 23 '10 at 14:35
  • I guess this is where my true annoyance with the language is. I prefer declarative immutable functional programming over mutable procedural languages like C. That being said I find it seriously ironic that C is the defactao language used for embedded programming. Today I think SAFETY should be far more important than performance (instead buy faster chips). I mean do you really want a memory leak in your elevators breaking system? – Adam Gent Jun 23 '10 at 14:36
  • 3
    @Adam Gent: Criticizing a language on the grounds that it isn't the sort of language you like is rather futile, isn't it? The reason C is heavily used for embedded programming is that it's efficient. If you're selling systems in the tens of millions, it's a lot cheaper to have engineers making the code safe than to have to spend an extra dollar each on a faster CPU or bigger program storage. – David Thornley Jun 23 '10 at 14:42
  • 2
    @Adam Gent - I want strict control over timing and memory in my high precision scientific instruments :) And it is safe, because I know at all times exactly what my code is doing - what I told it to do. Your definition of safety may vary. (Besides, find me a Haskell compiler for the PIC32MX series, or the dsPIC24.) – detly Jun 23 '10 at 14:43
  • 1
    You might want to change that to “`malloc` and `free`” there. – Donal Fellows Jun 23 '10 at 14:44
  • 2
    @detly touche.. touche... your point is taken and I didn't mean to offend. However if that is true why not program in assembly? It is 2010 right and chips are getting pretty cheap? When can we stop programming in C or do you believe C is that superior? Aren't we moving to a more concurrent architecture these days? It seems C is kind of weak in this area. – Adam Gent Jun 23 '10 at 14:45
  • 1
    Additionally, your frustration with the language is perfectly rational, but C has a context and a use like any other tool. The weakness of functional languages is that they are completely antithetical to maintaining, controlling and persisting **state**, which is exactly how many people design a control system. – detly Jun 23 '10 at 14:47
  • @jk - `malloc` and `free`! FREE! Arg! ... Can you tell I never use them? :P – detly Jun 23 '10 at 14:48
  • @Adam Gent - no offense, I don't want you to think you're being shouted down about it :) Anyway, assembly is a nightmare to read and write, and worse to maintain. Remember, it's *always* a practical consideration as to what language I use for the job - C is infinitely more readable, but still has a fairly close mapping to the same level of control (if desired). If a functional language were portable to my platform, I might like to try it, but I'd really need to rethink my design, and I'd *still* have to (re)learn all the low level details to make sure nothing blows up. – detly Jun 23 '10 at 14:55
  • 1
    In short... picking your language should not be done via [categorical imperative](http://en.wikipedia.org/wiki/Categorical_imperative) :) – detly Jun 23 '10 at 14:59
  • The upside of C is it lets you write code that works closer to the way the CPU works than some other languages. Something like ADA on a embedded system tends to have alot of undocumented features in the runtime support library to do all the hand holding it does for you. It practically includes it's own OS. That's not practical in all embedded systems. In the end not knowing how something is done can be more dangerous than having to do it yourself on an embedded system. i.e. When in doubt init your own pointers with NULL. – NoMoreZealots Jun 23 '10 at 20:12
  • 2
    @NoMoreZealots and @detly I will agree that C does map very well to how **stuff** really works and I am glad I learned the language as it helped me comprehend CPU architecture when I was in school. – Adam Gent Jun 23 '10 at 23:46
  • 2
    @Adam Gent: The amusing thing is, with the exception of IO issues, which can easily be abstracted, concurrent programming is required for CPU bound applications. Applying C to areas where this is allegedly easier in higher level languages, most times *removes the need for concurrent programming*, due to the extreme performance gains over implementations in those other languages. If this is still not enough, you'll be wanting to use C with your concurrent implementation anyway. Switch to C: 50x performance. Switch to concurrency: sublinear core count performance increase. – Matt Joiner Nov 19 '10 at 16:22
  • I agree and disagree with you. The issue with concurrency that higher level languages solve that is more difficult with C (and I don't necessarily mean threads) is locks and transactions not the concurrency itself. Many times I have bash-for-looped-background a bunch of Unix coreutil (C) programs and have gotten excellent performance. But if I wanted to span the processes across many servers because I'm running millions of them you could do it in C but it comes for free with something higher level like Erlang or Hadoop (Java). – Adam Gent Nov 20 '10 at 20:09
  • @Matt Joiner Checkout http://doubleclix.wordpress.com/2010/11/11/google-a-study-in-scalability-and-a-little-systems-horse-sense/ An important thing to know is IO does matter for big data problems... So I question the CPU performance gains you get from choosing C. Also 50x? Maybe if your using a scripting language. For .NET/Java/Ocaml/Haskell its 1.5x-3x. – Adam Gent Nov 20 '10 at 21:38
  • @Adam Gent: You make some good points. For IO bound problems, choice of language becomes pretty unimportant. I wonder what the memory usage difference is for those languages competing in the 1.5-3x CPU usage range. Also I find those benchmarks regarding algorithms can be biased: Many languages self-optimize tight loops, and perform much better in constrained problems than in the general case. – Matt Joiner Nov 20 '10 at 23:04
14

It's for performance.

C was first developed around the time of the PDP 11, for which 60k was a common maximum amount of memory, many will have had a lot less. Unnecessary assignments would be particularly expensive is this kind of environment

These days there are many many embedded devices that use C for which 60k of memory would seem infinite, the PIC 12F675 has 1k of memory.

David Sykes
  • 48,469
  • 17
  • 71
  • 80
  • I just don't get it though. Its getting a value from somewhere right? Somewhere its getting assigned. How could it be more costly for the runtime of C to point to NULL than make them assigned to some random value. – Adam Gent Jun 23 '10 at 13:17
  • Couldn't the compiler do some sort of optimization. Usually things that are invariant like always point to null are easier to optimize for. – Adam Gent Jun 23 '10 at 13:19
  • 12
    @Adam: The value was there before. It's just a reuse of a specific memory location. – tur1ng Jun 23 '10 at 13:19
  • 1
    It isn't needed to bring up embedded devices, just to remember that C was designed for a computer which allowed 64K of code and 64K of data at the same time. Other time, other constraints, other decisions. – AProgrammer Jun 23 '10 at 13:19
  • 4
    The runtime doesn't assign anything, just reuse what happen to be there. – AProgrammer Jun 23 '10 at 13:20
  • @tur1ng Now I remember the more specific reason. I wish future versions of C would change this and make it a compile time option. I cannot believe with high level languages like Haskell that can compile code that runs faster than C, C cannot have pointers that default to NULL. – Adam Gent Jun 23 '10 at 13:24
  • 1
    @Adam, BTW, compilers I use commonly are able to warn about use of uninitialized variables. Increase your warning level and fix what is found. Another thing, warnings and compile time options are out of scope of the standard, if you want them and don't have them, just lobby your compiler vendor. – AProgrammer Jun 23 '10 at 13:59
  • @Adam Gent: A given C implementation certainly could have pointers that defaulted to NULL, either normally or as a compile-time option. The Standard has no requirements for the values of certain uninitialized variables, so an implementation may do whatever it pleases in this case. BTW, are you claiming that compiled Haskell code is always faster than compiled C code, or only in your particular area of interest? – David Thornley Jun 23 '10 at 14:52
  • @Adam: The compiler doesn't init the pointer to a random value--it doesn't init it at all. (Technically, the value is not "random," but just difficult to predict.) Initializing a value to anything at all means the compiler has to generate (and the program has to execute) additional code, so it costs both space and time. C tries very hard to not waste space/time unless you tell it to. There is a bigger issue at play, though: If the value of an uninitialized variable is somehow relevant to the execution of your program, you're doing something wrong. :) – Casey Barker Jun 23 '10 at 17:03
  • @Casey Barker You hit the nail on the head. In Java and .NET land there are some people that use the keyword "final" on all there local variables to avoid this. I like how Scala makes a difference between "values" and "variables". – Adam Gent Jun 23 '10 at 17:46
  • @AProgrammer the PDP-11 on which C was developed had 24K bytes, the PDP-7 were B was developed had 8K 18-bit words. See http://cm.bell-labs.com/cm/cs/who/dmr/chist.html – ninjalj Jun 23 '10 at 18:32
  • @ninjalj, I was speaking of the architectural limit of PDP-11. Obviously not all systems were maxed out, and some models were even more limited that what the architecture would allow. My main point is that to understand C, you have to think about how things were at its time of rapid evolution (says till mid 80's). – AProgrammer Jun 24 '10 at 08:47
  • @David Thornley I'm not claiming Haskell is always faster than C. As a developer I never claim something "always" to be the case :) – Adam Gent Jun 25 '10 at 11:52
  • @CaseyBarker: If the semantics of a language dictate that declaring a variable will cause it to have a default value (as is the case with static-duration variables in C), code which reads a variable without having explicitly written it first may not be doing anything wrong. On some platforms, `static int foo=0;` and `static int foo;` may allocate `foo` differently; if the latter style of allocation is required, even if `foo` needs to be zero on startup, leaving it "uninitialized" may be the correct (and required) thing to do. – supercat Nov 25 '14 at 19:42
8

This is because when you declare a pointer, your C compiler will just reserve the necessary space to put it. So when you run your program, this very space can already have a value in it, probably resulting of a previous data allocated on this part of the memory.

The C compiler could assign this pointer a value, but this would be a waste of time in most cases since you are excepted to assign a custom value yourself in some part of the code.

That is why good compilers give warning when you do not initialize your variables; so I don't think that there are so much bugs because of this behavior. You just have to read the warnings.

slaphappy
  • 6,894
  • 3
  • 34
  • 59
  • I believe he meant "many bugs" historically; apparently older C compilers were not as friendly as their contemporaries. – Kenny Evitt Jun 23 '10 at 13:38
  • 1
    imho any attempt to access the value of an uninitialized variable is a bug. it doesn't matter whether it's zero or random, you shouldn't be trying to read it if you didn't explicitly set it to something. – eemz Jun 23 '10 at 13:57
  • @joefis but its easier to find and understand the bug if its NULL and not some random value. This is particular useful if you are doing concurrent programming. – Adam Gent Jun 23 '10 at 14:29
  • @Adam Gent: hmm it depends on your perspective. Personally I would think that garbage was a more clear indicator that I forgot to set the variable, rather than NULL which I might have done on purpose... – eemz Jun 23 '10 at 16:10
  • @AdamGent it's the responsibility of the programmer to initialize the value if needed before use, and the compiler already has warnings about that. Also setting it to NULL may not help you find the bug easier. MSVC and many other compilers already filled uninitialized memory with [0xCC, 0xCD](http://stackoverflow.com/questions/370195/when-and-why-will-an-os-initialise-memory-to-0xcd-0xdd-etc-on-malloc-free-new) or some other values in debug mode and it's much easier than 0 to recognize. You'll see some reference on 0xCCCCCCCC, some strange repeating characters or lots of similar cases – phuclv Jun 07 '14 at 03:21
7

Pointers are not special in this regard; other types of variables have exactly the same issue if you use them uninitialised:

int a;
double b;

printf("%d, %f\n", a, b);

The reason is simple: requiring the runtime to set uninitialised values to a known value adds an overhead to each function call. The overhead might not be much with a single value, but consider if you have a large array of pointers:

int *a[20000];
caf
  • 233,326
  • 40
  • 323
  • 462
  • Yes do to the procedural nature of C you frequently define variables before assigning them. So I can see how this becomes a performance problem. – Adam Gent Jun 23 '10 at 14:31
4

When you declare a (pointer) variable at the beginning of the function, the compiler will do one of two things: set aside a register to use as that variable, or allocate space on the stack for it. For most processors, allocating the memory for all local variables in the stack is done with one instruction; it figures out how much memory all the local vars will need, and pulls down (or pushes up, on some processors) the stack pointer by that much. Whatever is already in that memory at the time is not changed unless you explicitely change it.

The pointer is not "set" to a "random" value. Before allocation, the stack memory below the stack pointer (SP) contains whatever is there from earlier use:

         .
         .
 SP ---> 45
         ff
         04
         f9
         44
         23
         01
         40
         . 
         .
         .

After it allocates memory for a local pointer, the only thing that has changed is the stack pointer:

         .
         .
         45
         ff |
         04 | allocated memory for pointer.
         f9 |
 SP ---> 44 |
         23
         01
         40
         . 
         .
         .

This allows the compiler to allocate all local vars in one instruction that moves the stack pointer down the stack (and free them all in one instruction, by moving the stack pointer back up), but forces you to initialize them yourself, if you need to do that.

In C99, you can mix code and declarations, so you can postpone your declaration in the code until you are able to initialize it. This will allow you to avoid having to set it to NULL.

Tim Schaeffer
  • 2,616
  • 1
  • 16
  • 20
3

First, forced initialization doesn't fix bugs. It masks them. Using a variable that doesn't have a valid value (and what that is varies by application) is a bug.

Second, you can often do your own initialization. Instead of int *p;, write int *p = NULL; or int *p = 0;. Use calloc() (which initializes memory to zero) rather than malloc() (which doesn't). (No, all bits zero doesn't necessarily mean NULL pointers or floating-point values of zero. Yes, it does on most modern implementations.)

Third, the C (and C++) philosophy is to give you the means to do something fast. Suppose you have the choice of implementing, in the language, a safe way to do something and a fast way to do something. You can't make a safe way any faster by adding more code around it, but you can make a fast way safer by doing so. Moreover, you can sometimes make operations fast and safe, by ensuring that the operation is going to be safe without additional checks - assuming, of course, that you have the fast option to begin with.

C was originally designed to write an operating system and associated code in, and some parts of operating systems have to be as fast as possible. This is possible in C, but less so in safer languages. Moreover, C was developed when the largest computers were less powerful than the telephone in my pocket (which I'm upgrading soon because it's feeling old and slow). Saving a few machine cycles in frequently used code could have visible results.

David Thornley
  • 56,304
  • 9
  • 91
  • 158
1

So, to sum up what ninjalj explained, if you change your example program slightly you pointers will infact initialize to NULL:

#include <stdio.h>

// Change the "storage" of the pointer-variables from "stack" to "bss"  
int * randomA;
int * randomB;

void main() 
{
  int * nullA = NULL;
  int * nullB = NULL;

  printf("randomA: %p, randomB: %p, nullA: %p, nullB: %p\n\n", 
     randomA, randomB, nullA, nullB);
}

On my machine this prints

randomA: 00000000, randomB: 00000000, nullA: 00000000, nullB: 00000000

S.C. Madsen
  • 5,100
  • 5
  • 32
  • 50
0

I think it comes from the following: there's no reason why memories should contain (when powered up) specific values (0, NULL or whatever). So, if not previously specifically written, a memory location can contain whatever value, that from your point of view, is anyway random (but that very location could have been used before by some other software, and so contain a value that was meaningful for that application, e.g. a counter, but from "your" point of view, is just a random number). To initialize it to a specific value, you need at least one instruction more; but there are situation where you don't need this initialization a priori, e.g. v = malloc(x) will assign to v a valid address or NULL, no matter the initial content of v. So, initializing it could be considered a waste of time, and a language (like C) can choose not to do it a priori. Of course, nowadays this is mainly insignificant, and there are languages where uninitialized variables have default values (null for pointers, when supported; 0/0.0 for numerical... and so on; lazy initialization of course make it not so expensive to initialize an array of say 1 million elements, since they are initialized for real only if accessed before an assignment).

ShinTakezou
  • 9,432
  • 1
  • 29
  • 39
0

The idea that this has anything to do with random memory contents when a machine is powered up is bogus, except on embedded systems. Any machine with virtual memory and a multiprocess/multiuser operating system will initialize memory (usually to 0) before giving it to a process. Failure to do so would be a major security breach. The 'random' values in automatic-storage variables come from previous use of the stack by the same process. Similarly, the 'random' values in memory returned by malloc/new/etc. come from previous allocations (that were subsequently freed) in the same process.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • He does not think that it has anything to do with random memory contents, he is just using the variable names randomX in his example to point out that an uninitialized pointer seems to get initialized to a "random" address – b00n12 Jun 21 '15 at 08:29
-1

For it to point to NULL it would have to have NULL assigned to it ( even if it was done automatically and transparently ).

So, to answer your question, the reason a pointer can't be both unassigned and NULL is because a pointer can not be both not assigned and assigned at the same time.

  • In other languages such as Java and C# unassigned values still get some predictable value. Your argument based purely on your semantics and not mine. – Adam Gent Jun 23 '10 at 23:40