0

This board seems filled with these type of questions and I did go through a few of them, but I still am confused regarding the difference between those two:

char string[] = "Hello"; 
char * string2 = "hello";

Now,

char string[] = "Hello"; 

... is an array and is allocated 6 consecutive memory address spaces where the characters are stored, including \0 at the end.

%p of &string shows memory address 0x7fffbcabce90. %p of string shows the same memory address 0x7fffbcabce90. %p of string[1] shows 0x7fffbcabce90 + typeof(char), so 0x7fffbcabce91. etc.

char * string2 = "hello";

... is a pointer to char type and it points to memory address of the first character of the string (h).

%p of &string2 shows memory address 0x7fffbcabce88. %p of string2 shows a different memory address of 0x400ca8. %p of string2[1] shows 0x400ca8 + sizeof(char), so 0x400ca9.

My questions are: what is this memory address range (0x400000)? Is it the reason why I can't modify the string characters like so?:

string[1] = 'c';   //that works
string2[1] = 'c';  //not working

Thanks!

Edit: typo (%string2 => &string2)

Edit2: As was explained to me, the keyword is string literal.

char * string2[] = "Hello"; 

... is a pointer to a string literal. Here's a thread where R Samuel Klatchko explains where are stored literals in memory:

String literals: Where do they go?

Community
  • 1
  • 1
ChibiSlick
  • 37
  • 1
  • 3
  • 1
    BTW I cannot possibly imagine that you haven't found a single duplicate from which you could understand this. I remember having explained this myself **several times.** –  Dec 08 '13 at 17:55
  • @H2CO3 Would you please be so kind to give me an example where you're explaining about that memory address range? Two comments bellow, you're telling me: "@ChibiSlick I don't know. Nor should you worry about that. – H2CO3 1 min ago." Are you confused? :) – ChibiSlick Dec 08 '13 at 18:02
  • 2
    No, I'm not confused, but this is not C anymore. You are asking about a system's implementation detail. And I'm bad at low-level stuff, so I don't know why the linker chose that address or why is `0x4000000` read-only. –  Dec 08 '13 at 18:05

3 Answers3

6
char string[] = "Hello"; 

declares a null-terminated array of char and initialises it to contain "Hello\0".

char * string2 = "hello";

declares a pointer to char and initialises it to point to a string literal.

Modifying a string literal invokes undefined behaviour. It's common for compilers to put string literals in read only memory, and it would seem that is what your compiler is doing. Not all compilers will do that – the key point is that modifying string literals invokes undefined behaviour.

There's little to be gained from looking at the actual values of the pointers in your program. The read-only address is close to 0x00400000 because, I presume, you are running on Windows and that's the default load address of an executable module. But nothing says that a module must load there. A library won't.


Let's look at:

printf("%p %p %p %p", string1, &string1, string2, &string2);

In this expression, string1 decays to a pointer, and so gives the same output as &string1. And string2 is a pointer, and &string2 is that pointer's address.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • you should also explain the address range - as the question mentions it specifically. – elyashiv Dec 08 '13 at 17:44
  • @elyashiv How would I explain that? It's really up to the compiler and the OS to decide where things are stored. What significance can you ascribe to the actual pointer values? – David Heffernan Dec 08 '13 at 17:45
  • @DavidHeffernan I think he is referring to the fact that the numerical value of `&array[0]` and `&array` are the same, whereas those of `&pointer[0]` and `&pointer` differ. –  Dec 08 '13 at 17:48
  • @H2CO3 The question doesn't even say that. I guess we have to assume what the asker really meant. – David Heffernan Dec 08 '13 at 17:53
  • I understand the part where one is an array and the other is a pointer. I also get that the memory addresses are different, of course they are, one is an array, other is a pointer :) My interrogations are about the memory range bing used (0x400000). What is this range? Why is it using this range? Is it ready-only? Is it the reason I can't change the character values directly? Can I explicitly use a different range? – ChibiSlick Dec 08 '13 at 17:53
  • @DavidHeffernan ***But yes it does.*** Quote from the question: "%p of `&string` shows memory address `0x7fffbcabce90`. %p of `string` shows **the same memory address** `0x7fffbcabce90`" –  Dec 08 '13 at 17:54
  • @H2CO3 Actually it says: "%p of %string2 shows memory address 0x7fffbcabce88" whatever that means – David Heffernan Dec 08 '13 at 17:55
  • @ChibiSlick You are worrying about implementation but you are missing that point. You cannot modify string literals. End of story. It doesn't matter where they are stored. – David Heffernan Dec 08 '13 at 17:56
  • @DavidHeffernan Hey, can't you read my quote? It **does say that the address of an array appears to be the same as the address of its first element.** The part you are referring to is about the **pointer, not the array.** –  Dec 08 '13 at 17:56
  • @H2CO3 I can read. What is `%string2`? Is that some syntax that I'm not familiar with? – David Heffernan Dec 08 '13 at 17:59
  • @DavidHeffernan No. That's a typo. (Sorry but you were completely missing my point. I was not trying to assume anything.) –  Dec 08 '13 at 18:00
  • @DavidHeffernan Not impossible. Anyway, do you see now what I was trying to say? –  Dec 08 '13 at 18:04
  • @David Heffernan Thanks David. I guess the keyword here is "literal" where as one defines a string literal vs an array of char. They would be stored in difference places in memory.. – ChibiSlick Dec 08 '13 at 18:15
  • @H2CO3 Yeah, I can see it. Arrays decay to pointers. And so on. Not in a position to edit answer just now. Will do soon. – David Heffernan Dec 08 '13 at 18:15
  • @Chibi It's a pure compiler detail where they are stored. But I can tell you are on Windows. The literal is in the module. And hence 0x00400000 base address. – David Heffernan Dec 08 '13 at 18:23
  • @David Heffernan I'm actually on centos running in a VM. I've just found an interesting thread that answers the question of where are stored literals. Updating original post. – ChibiSlick Dec 08 '13 at 18:27
  • @ChibiSlick That post says nothing more than I already said. As for the values of the address, they really are implementation details. We have to guess because we don't know what OS you are using. But really, why do you care about the actual address values? That's really not relevant from a C language perspective. – David Heffernan Dec 08 '13 at 18:34
2

The layout of the virtual address space of a process is generally arranged by the linker and the loader. Generally, the memory of a process can be organized into types including:

  • The “text” section, which contains the instructions that are executed.
  • Read-only data, which contains values that are read but are not modified or executed.
  • Initialized data, which is set to initial values at program start-up but may be modified during program execution.
  • Uninitialized data, which the program needs for work space but does not need to be set to any particular values at program start-up.

There are various embellishments and refinements of the types of memory, but the above suffice for a general orientation.

Separating memory of different types is important for performance and for security, including:

  • Because the text and read-only data sections are not modified during program execution, they can be shared if the same program is executed more than once simultaneously. Each time the program is run, the operating system can map these portions of the virtual address spaces of different processes to the same physical memory. Every process will have the same data in these sections.
  • Because the non-text sections should never be executed as instructions, they can be marked as non-executable. This means the hardware will cause an exception if an attempt is made to execute them, which should only occur if there is a bug or an attacker causes the program to execute things it should not.

The operating system does not keep track of every individual byte in the address space of a process. Most hardware does not support it, and it would require too much data. Instead, there is a minimum amount of memory, called a page, that is used. Whenever memory is marked executable or not-executable, writable or not writable, it must be done in units of whole pages. 4096 bytes is a typical page size, but it varies from system to system.

What you are seeing in the difference addresses of a char string[] = "Hello"; and the char *string2 = "Hello"; is that the read-only data of the string literal "Hello" is being put into a different page than the modifiable array initialized with "Hello".

There is nothing magic about the address 0x400000, except that it was chosen as the place for read-only data in the system you are using. It could be elsewhere. There may even be linker options to move it to an address of your choosing. What is important is merely that it is separate from the modifiable data.

While the linker is reading object modules and organizing them to form one executable, it concatenates the same types of segments from different object modules. That is, it takes the text segments from each module and puts them together into one large text segment. It takes the read-only data segments from each module and puts them together into one large data segment. And so on. This is simply more efficient than leaving each object module’s segments as separate pieces—if one object module used 2.5 pages for read-only data and another used 1.5 pages, then putting them together uses just 4 pages, whereas leaving them separate with fragments of pages unused would use 5 pages. (Some special segments might be processed in different ways than concatenation.)

Additionally, although memory can be managed in units of pages, there may be benefits to grouping larger amounts of memory of the same type together. If you mark an entire megabyte to have the same attributes, the operating system might use less data to keep track of it than if you marked pages individually. This could be part of the reason that the linker sets aside a large amount of program address space for a certain type of memory. (I am speculating; I am not familiar with the current motivations and design for the specific operating system you are using.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • That would be an answer of epic proportions, if such things would exist. In the meantime, I'll mark your answer as the accepted one :) Thanks Eric! – ChibiSlick Dec 08 '13 at 19:30
  • *What is important is merely that it is separate from the modifiable data.* It doesn't need to be. It could be right next to the modifiable memory. It could be in the same page. – David Heffernan Dec 08 '13 at 19:56
  • @DavidHefferan: I am not sure what you mean. It is important that read-only data be in separate pages from modifiable data so that it can be marked with different attributes. Read-only data and modifiable data cannot be in the same page; the page can only be read-only or modifiable, not both. Perhaps you mean the C standard permits string literals to be in modifiable memory, but that is not what this answer is about. – Eric Postpischil Dec 08 '13 at 20:11
  • I'm not really sure what the answer is about then. Because the question is about C rather than some specific implementation. I think the asker is not really grasping the difference between what the standard says, and specific implementations. There are plenty of C implementations for which writing to a string literal does not cause program failure. Even the concept of pages is implementation specific and again there are machines for which the concept has no meaning. – David Heffernan Dec 08 '13 at 20:30
  • @DavidHeffernan: The question is tagged C, but it is clearly about memory layout. The OP wanted to know why a string literal was at such a different address than a non-literal array object. I answered them. – Eric Postpischil Dec 08 '13 at 20:47
  • Two differenct objects will be stored at different addresses. There's not much more to it than that. – David Heffernan Dec 08 '13 at 20:55
  • @DavidHeffernan: Yes, there is more to it than that. The question is not about what the C standard says. The question is about how operating systems (or linkers and loaders at least) lay out memory. The fact that the OP discovered it working in C is largely irrelevant (and you could edit the tags if you think that suitable). This is a question about how computers work. Expecting to limit an answer to the C standard is inappropriate. **This answer gives the information the OP wanted.** – Eric Postpischil Dec 08 '13 at 21:10
  • Yes, you've given the answer that the asker wanted. – David Heffernan Dec 08 '13 at 21:19
  • Please, don't fight over this. It's silly. Obviously, I did not understand clearly that memory allocation of string literals (among other things) were done at a lower level. I guess that's the process of learning. I've changed the tags to reflect the context of my question. Thanks to both of you for your help. – ChibiSlick Dec 08 '13 at 21:36
  • @Chibi What do you mean "lower level"? – David Heffernan Dec 09 '13 at 07:20
0

Because an array is not a pointer.

char string[] = "Hello"; 

declares string to be an array and initializes it with the characters of the string on the RHS. This array is not declared as const, so you can modify its elements.

Of course, the array begins in memory where its first element lies (well, at least on any sensible implementation I know of, but this is not a requirement), so naturally string (which decays into a pointer to the first element) has the same numerical pointer value as &string, which is a pointer to an array itself. They have different types, though (pointer-to-char and pointer-to-array-of-6-chars).

This is how it looks like in memory:

+-----+-----+-----+-----+-----+-----+
| 'H' | 'e' | 'l' | 'l' | 'o' | \0  |
+-----+-----+-----+-----+-----+-----+
^
+-- pointer to first element
|
+-- also: pointer to array

On the other hand,

char *string2 = "Hello";

is wrong: here you are assigning a pointer (a pointer that the array "Hello" decays into) of read-only characters to a pointer-to-non-const. This should be

const char *string2 = "Hello";

Now string2 is a pointer, it has a separate storage from that of its pointed value (the first element of the array). So naturally its address is different from that of the first element of the array (which it points to). And it's not initialized with the array. The array (the string literal) is (presumably) placed in read-only memory; the Standard says it's undefined behavior to attempt to modify its contents.

This is how the construct with the pointer is laid out in memory:

+-----+            +-----+-----+-----+-----+-----+-----+
|  p  | ---------> | 'H' | 'e' | 'l' | 'l' | 'o' | \0  |
+-----+            +-----+-----+-----+-----+-----+-----+
^                  ^
+- adddress of     +- address of the array (and the first element)
   the pointer        This is the **value** that is stored in p
   itself
   (completely
   unrelated to
   the address
   of the array)
  • I thought I took enough time to show that I knew the difference between the array and the pointer, I guess not :) You mentioned the address range 0x400000 being ready only.. My confusion is about that. What is this range? Why is it using this range? Is it ready-only? Is it the reason I can't change the character values directly? Can I explicitly use a different range? – ChibiSlick Dec 08 '13 at 17:56
  • 2
    @ChibiSlick I don't know. Nor should you worry about that. –  Dec 08 '13 at 17:57