22

Inspired by this question.

We can initialize a char pointer by a string literal:

char *p = "ab";

And it is perfectly fine. One could think that it is equivalent to the following:

char *p = {'a', 'b', '\0'};

But apparently it is not the case. And not only because the string literals are stored in a read-only memory, but it appears that even through the string literal has a type of char array, and the initializer {...} has the type of char array, two declarations are handled differently, as the compiler is giving the warning:

warning: excess elements in scalar initializer

in the second case. What is the explanation of such a behavior?

Update:

Moreover, in the latter case the pointer p will have the value of 0x61 (the value of the first array element 'a') instead of a memory location, such that the compiler, as warned, taking just the first element of the initializer and assigning it to p.

Community
  • 1
  • 1
Eugene Sh.
  • 17,802
  • 8
  • 40
  • 61
  • 2
    ha... didn't know it does that. I could have sworn there were identical. – bolov May 29 '15 at 15:34
  • 7
    @bolov Yep. This is why I like reading questions on SO. It is proving sometimes that considering yourself a pro in something is a *slight* overestimate... By showing a very simple things. – Eugene Sh. May 29 '15 at 15:36
  • 2
    I think the string initialization syntax works because, ultimately, its replaced by its location in memory and resolves to an address; on the other hand, the compiler sees a char *p as a place to store a single value, and the explicit array initialization implies more than one value to be stored. – David W May 29 '15 at 15:40
  • @DavidW this is basically how I am explaining it to myself, that the initialization expressions are working in a different way than runtime expressions. – Eugene Sh. May 29 '15 at 16:01

4 Answers4

9

I think you're confused because char *p = "ab"; and char p[] = "ab"; have similar semantics, but different meanings.

I believe that the latter case (char p[] = "ab";) is best regarded as a short-hand notation for char p[] = {'a', 'b', '\0'}; (initializes an array with the size determined by the initializer). Actually, in this case, you could say "ab" is not really used as a string literal.

However, the former case (char *p = "ab";) is different in that it simply initializes the pointer p to point to the first element of the read-only string literal "ab".

I hope you see the difference. While char p[] = "ab"; is representable as an initialization such as you described, char *p = "ab"; is not, as pointers are, well, not arrays, and initializing them with an array initializer does something entirely different (namely give them the value of the first element, 0x61 in your case).

Long story short, C compilers only "replace" a string literal with a char array initializer if it is suitable to do so, i.e. it is being used to initialize a char array.

  • The OP understands this .. He basically wants to know why initializer list is not interpreted as a literal.. – Gopi May 29 '15 at 16:14
  • @Gopi Almost, it's more like why the literal is not interpreted as a list. – Eugene Sh. May 29 '15 at 16:15
  • @EugeneSh.: I am sorry. See update. I hope what I meant to say is clearer now –  May 29 '15 at 16:16
  • @EugeneSh. Strings as we know in C has mutiple variants and it is convention to represent literls within double quotes is what I think. ( The standard supports my theory) :) – Gopi May 29 '15 at 16:20
  • So, as I can see from all of the answers, everything is converging to the syntax. String literals seem to be treated differently depending on the context. – Eugene Sh. May 29 '15 at 16:24
  • @EugeneSh.: yes, that's exactly the case (and what I was trying to convey through my answer...) –  May 29 '15 at 16:24
  • I think I will accept this as an answer, through the others were helpful as well. – Eugene Sh. May 29 '15 at 16:30
  • @Mints97 when you say `C compilers only "replace" a string literal with a` char `array initializer if it is suitable to do so, i.e. it is being used to initialize a char array`, you are of course referring to the case of `char p[] = "ab";` correct? The fact that the string lit. `"ab"` is being *replaced* by essentially something like `char p[] = { 'a', 'b', '\0' };` right? – RastaJedi Apr 22 '16 at 02:11
  • After reading the answer by @Gopi it seems as though it's still called a string lit. even when `"ab"` is used to initialize `char p[]`. I wonder if there was ever memory created for the string literal in the `char p[] = "ab";` case vs. the `char *p = "ab";` case, by that I mean I wonder about whether there is a difference between those two string literals, like since one remains accessible (but unmodifiable) and the other seems only temporary... if there was memory for it at one point, does it remain? – RastaJedi Apr 22 '16 at 02:19
8

String literals have a "magical" status in C. They're unlike anything else. To understand why, it's useful to think about this in terms of memory management. For example, ask yourself, "Where is a string literal stored in memory? When is it freed from memory?" and things will start making sense.

They're unlike numeric literals which translate easily to machine instructions. For a simplified example, something like this:

int x = 123;

... might translate to something like this at the machine level:

mov ecx, 123

When we do something like:

const char* str = "hello";

... we now have a dilemma:

mov ecx, ???

There's not necessarily some native understanding of the hardware of what a multi-byte, variable-length string actually is. It mainly knows about bits and bytes and numbers and has registers designed to store these things, yet a string is a memory block containing multiple of those.

So compilers have to generate instructions to store that string's memory block somewhere, and so they typically generate instructions when compiling your code to store that string somewhere in a globally-accessible place (typically a read-only memory segment or the data segment). They might also coalesce multiple literal strings that are identical to be stored in the same memory region to avoid redundancy. Now it can generate a mov/load instruction to load the address to the literal string, and you can then work with it indirectly through a pointer.

Another scenario we might run into is this:

static const char* some_global_ptr = "blah";

int main()
{
    if (...)
    {
        const char* ptr = "hello";
        ...
        some_global_ptr = ptr;
    }
    printf("%s\n", some_global_ptr);
}

Naturally ptr goes out of scope, but we need that literal string's memory to linger around for this program to have well-defined behavior. So literal strings translate not only to addresses to globally-accessible memory blocks, but they also don't get freed as long as your binary/program is loaded/running so that you don't have to worry about their memory management. [Edit: excluding potential optimizations: for the C programmer, we never have to worry about the memory management of a literal string, so the effect is like it's always there].

Now about character arrays, literal strings aren't necessarily character arrays, per se. At no point in the software can we capture them to an array r-value that can give us the number of bytes allocated using sizeof. We can only point to the memory through char*/const char*

This code actually gives us a handle to such an array without involving a pointer:

char str[] = "hello";

Something interesting happens here. A production compiler is likely going to apply all kinds of optimizations, but excluding those, at a basic level such code might create two separate memory blocks.

The first block is going to be persistent for the duration of the program, and will contain that literal string, "hello". The second block will be for that actual str array, and it's not necessarily persistent. If we wrote such code inside a function, it's going to allocate memory on the stack, copy that literal string to the stack, and the free the memory from the stack when str goes out of scope. The address of str is not going to match the literal string, to put it another way.

Finally, when we write something like this:

char str[] = {'h', 'e', 'l', 'l', 'o', '\0'};

... it's not necessarily equivalent, as here there are no literal strings involved. Of course an optimizer is allowed to do all kinds of things, but in this scenario, it is possible that we will simply create a single memory block (allocated on the stack and freed from the stack if we're inside a function) with instructions to move all these numbers (characters) you specified to the stack.

So while we're effectively achieving the same effect as the previous version as far as the logic of the software is concerned, we're actually doing something subtly different when we don't specify a literal string. Again, optimizers can recognize when doing something different can have the same logical effect, so they might get fancy here and make these two effectively the same thing in terms of machine instructions. But short of that, this is subtly different code we're writing.

Last but not least, when we use initializers like {...}, the compiler expects you to assign it to an aggregate l-value with memory that is allocated and freed at some point when things go out of scope. So that's why you're getting the error trying to assign such a thing to a scalar (a single pointer).

  • `The first block is going to be persistent for the duration of the program, and will contain that literal string, "hello"`... any compiler that would be doing that without a `char *otherStr = "hello"` anywhere else in the code is going to be wasting precious memory. –  May 29 '15 at 16:40
  • Yeah, to try to simplify I didn't go into too many details of how an optimizer would handle this. I mainly just wanted to explain the difference from a basic level. Perhaps I should add a few more caveats. –  May 29 '15 at 16:43
  • How can we know whether two blocks of memory is allocated in case of char str[] = "hello";? One for string literal and the other for a[]. How to check this out after compilation? – Jon Wheelock Oct 18 '15 at 01:16
  • @JonWheelock Can check the resulting assembly. Another way but won't necessarily guarantee a thorough result is to output the address of both (`str` and `&a`). However, doing that could have a remote chance of interfering with the optimization -- best way is to always check the assembly code. –  Oct 18 '15 at 12:57
7

The second example is syntactically incorrect. In C, {'a', 'b', '\0'} can be used to initialize an array, but not a pointer.

Instead, you can use a C99 compound literal (also available in some compilers as extension, e.g, GCC) like this:

char *p = (char []){'a', 'b', '\0'};

Note that it's more powerful as the initializer isn't necessarily null-terminated.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
6

From C99 we have

A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes

So in the second definition there is no string literal as it is not within the double quotes. The pointer should be allocated memory before writing something to it or if you want to go by initializer list then

char p[] = {'a','b','\0'};

is what you want. Basically both are different declarations.

Spikatrix
  • 20,225
  • 7
  • 37
  • 83
Gopi
  • 19,784
  • 4
  • 24
  • 36
  • Yes, both parts are clear. I guess the question is reducing to "why is string literal is not treated the same way as a char array (even if the latter is RO)? " There is no such a *type* in c for "string literal" as far as I know. – Eugene Sh. May 29 '15 at 15:46
  • @EugeneSh.because arrays and string literals are two different thing? – SwiftMango May 29 '15 at 16:00
  • @EugeneSh.: I guess it's to present a way to easily use immutable strings? –  May 29 '15 at 16:00
  • @EugeneSh. There has to be some convention to separate literals from initializer list. As you know that literals are read-only but initializer list needs to be r/w . Again what you are assigning to also matters as array and pointers are different things. – Gopi May 29 '15 at 16:02
  • @texasbruce Apparently they ARE different. But where this difference is reflected? – Eugene Sh. May 29 '15 at 16:03