47

From C Programming Language by Brian W. Kernighan

& operator only applies to objects in memory: variables and array elements. It cannot be applied to expressions, constants or register variables.

Where are expressions and constants stored if not in memory? What does that quote mean?

E.g:
&(2 + 3)

Why can't we take its address? Where is it stored?
Will the answer be same for C++ also since C has been its parent?

This linked question explains that such expressions are rvalue objects and all rvalue objects do not have addresses.

My question is where are these expressions stored such that their addresses can't be retrieved?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Aquarius_Girl
  • 21,790
  • 65
  • 230
  • 411
  • 10
    Expressions aren't stored anywhere, that's why you can't get their address. Same with numeric literal constants. – Some programmer dude Dec 19 '17 at 09:56
  • 1
    the value of the expression is directly stored in processor's registers so it does not have a memory address which can be addressed by the & operator – rob_ Dec 19 '17 at 09:56
  • 10
    Possible duplicate of [Why is taking the address of a temporary illegal?](https://stackoverflow.com/questions/4301179/why-is-taking-the-address-of-a-temporary-illegal) or [taking the address of a temporary object](https://stackoverflow.com/questions/4301179/why-is-taking-the-address-of-a-temporary-illegal) or [Where are temporary object stored?](https://stackoverflow.com/questions/9109831/where-are-temporary-object-stored) or etc. However you slice it, this has been asked and answered many times. And understanding how high-level code is assembled into machine code would answer this. – underscore_d Dec 19 '17 at 09:58
  • In a sense you are correct. If you at Ruslan's answer you see the constants included in the executable code. If you look at the binary produced by the compiler/assembler a literal zero really is stored in memory. It's not considered 'accessible' because it's not in a 'variable'. Kudos to you for trying to understand at such a deep level. – Jay Dec 19 '17 at 16:32
  • 1
    They *are* stored in memory, just not addressable memory. CPUs have working registers, flags, and even instructions with constants implied. – Lee Daniel Crocker Dec 19 '17 at 18:59
  • 3
    C++ muddies the waters a bit, in that the result of an expression *can* be passed to a function taking a constant reference `const T&` or an rvalue reference `T&&`, and then that function can take the address of that argument if it wants to. Even if `T` is a simple built-in type like `int`! – Daniel Schepler Dec 19 '17 at 19:52
  • @Someprogrammerdude: I think that's oversimplifying it a bit. Of course expressions, or their target machine image, are stored in memory somewhere, otherwise, the computer couldn't execute it. Likewise, there is no strict requirement for all numeric literals to be inlined (e.g. in form of `fldpi` or the no-longer-an-optimization `xor eax, eax` to load zero). Likkelylikewise, AFAIK, C++ does not require string literals or non-trivial constant objects to _have_ an address. – Sebastian Mach Dec 20 '17 at 10:43
  • @SebastianMach A string literal is required to be stored as an array of constant (in C++) characters, an array where one could get a pointer to its first element as well as a pointer to the array itself. And object, constant or not, trivial or not, must be *defined* and that definition means memory may be allocated for that object, memory which can be pointed to. And if the object is ODR-used (like using the address-of operator on the object) then it most certainly can't be inlined or optimized away, the object *must* exist in data memory. – Some programmer dude Dec 20 '17 at 10:55
  • @Someprogrammerdude: It was more of a philosophical stretch. If there is no one to see the forest, does it even exist? Compare https://godbolt.org/g/rVxJKP -> Omitting the second string literal does not change the observable behavior. I also did not find any standard requirement that string literals need the f*** be stored if not used. The "if [...] ODR-used" in your comment is key. – Sebastian Mach Dec 20 '17 at 13:36

5 Answers5

63

Consider the following function:

unsigned sum_evens (unsigned number) {
  number &= ~1; // ~1 = 0xfffffffe (32-bit CPU)
  unsigned result = 0;
  while (number) {
    result += number;
    number -= 2;
  }
  return result;
}

Now, let's play the compiler game and try to compile this by hand. I'm going to assume you're using x86 because that's what most desktop computers use. (x86 is the instruction set for Intel compatible CPUs.)

Let's go through a simple (unoptimized) version of how this routine could look like when compiled:

sum_evens:
  and edi, 0xfffffffe ;edi is where the first argument goes
  xor eax, eax ;set register eax to 0
  cmp edi, 0 ;compare number to 0
  jz .done ;if edi = 0, jump to .done
.loop:
  add eax, edi ;eax = eax + edi
  sub edi, 2 ;edi = edi - 2
  jnz .loop ;if edi != 0, go back to .loop
.done:
  ret ;return (value in eax is returned to caller)

Now, as you can see, the constants in the code (0, 2, 1) actually show up as part of the CPU instructions! In fact, 1 doesn't show up at all; the compiler (in this case, just me) already calculates ~1 and uses the result in the code.

While you can take the address of a CPU instruction, it often makes no sense to take the address of a part of it (in x86 you sometimes can, but in many other CPUs you simply cannot do this at all), and code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)). In some CPU architectures, code addresses and data addresses are completely incompatible (although this is not the case of x86 in the way most modern OSes use it).

Do notice that while (number) is equivalent to while (number != 0). That 0 doesn't show up in the compiled code at all! It's implied by the jnz instruction (jump if not zero). This is another reason why you cannot take the address of that 0 — it doesn't have one, it's literally nowhere.

I hope this makes it clearer for you.

aaaaaa123456789
  • 5,541
  • 1
  • 20
  • 33
  • Thanks for the effort! Please explain this: `the constants in the code (0, 2, 1) actually show up as part of the CPU instructions!` Which lines are you pointing to? I don't know how to read assembly code. – Aquarius_Girl Dec 19 '17 at 10:13
  • 4
    (@aaa: Drop a reference to GodBolt somewhere: https://godbolt.org/) (and you need `int result`) – Jongware Dec 19 '17 at 10:14
  • 1
    @Aquarius_Girl `sub edi, 2` contains the 2, for instance. `sub` is the instruction (technically called the opcode mnemonic), which means to subtract the second operand from the first and store the result in the first; `edi` is the first operand (a CPU register) and `2` is the second operand. So this does the equivalent of edi = edi - 2 — the 2 is encoded as part of the instruction. It could also have been compiled to `dec edi` twice — the `dec` instruction subtracts one from its operand, so doing it twice subtracts 2. In that case, the 2 wouldn't have shown up at all. – aaaaaa123456789 Dec 19 '17 at 10:18
  • 2
    The same could be said for many named variables too - they may well not exist in main memory either. That is, not until the compiler is forced to put them there because you take their address. So it's unclear why the same wouldn't apply to expressions. – Oliver Charlesworth Dec 19 '17 at 10:18
  • 3
    @OliverCharlesworth that applies clearly in this example, as `result` doesn't have an address either. However, if the compiler was forced to give it one, the semantics of how to handle the variable are clear. Constants are more awkward in that regard. But you _can_ do `((unsigned []) {3})` if you need a pointer to a constant 3, for instance — that's well defined by the language. (At least in C; not sure about C++ here.) – aaaaaa123456789 Dec 19 '17 at 10:20
  • 8
    I guess what I mean is that this answer more or less has it backwards - you're looking at the way the compiler *does* behave given the *actual* semantics of the language, and then stating that as the rationale for those semantics. But the compiler will (and could) generate whatever machine code was necessary to implement the semantics of the language - there's no intrinsic/physical reason why one *couldn't* take the address of expressions, etc. – Oliver Charlesworth Dec 19 '17 at 11:33
  • 6
    However, riffing off your previous comment - an answer that explained how it would be difficult to define sensible semantics (particularly in terms of lifetime) for taking the address of an expression, would be a better answer (IMO ;) – Oliver Charlesworth Dec 19 '17 at 11:37
  • 2
    If you use `test edi,edi` instead of `cmp edi,0`, you'll make the point of non-occurrence of constant `0` in the assembly code even more valid. – Ruslan Dec 19 '17 at 13:39
  • @Ruslan very true; I forgot about that instruction. – aaaaaa123456789 Dec 19 '17 at 19:11
  • I think this answer *does* actually help explain *why the standard* is this way, but it could be improved by making the connection more explicit: C was *initially* implemented this way because of how hardware CPUs worked, and because it got you better performance on those old machines if your compiler could trivially translate the language into the machine code. The *extra logic* needed to virtually "take the address" of something that was easier to compile as a constant value into the machine code would've been unwanted overhead for C's goals. The standard just made this official. – mtraceur Dec 19 '17 at 20:57
  • `on some CPUs like x86, code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)).` Is this at all platform/OS dependent? It's been years now but on Win32/x86 I've done a `memcpy` from a function pointer and was able to successfully call the target code at runtime, or start a thread against it. – briantist Dec 20 '17 at 02:11
  • 2
    @briantist that was part of a suggested edit, and it was partially wrong for x86. By the manual, it's true; code and data live in different segments, each with their own addressing space. But the way most modern OSes use the CPU is by making all segments point to the same region of memory, thus making data pointers (from the data segment) and code pointers (from the code segment) point to the same part of memory when they are numerically equal. Both Windows and Linux do this, as far as I know. I edited the post to make this clearer. – aaaaaa123456789 Dec 20 '17 at 09:18
  • @briantist Yes it's platform dependent in the same way the bitwise representation of signed integers is platform dependent: not on any common modern hardware, but certainly historically. Also for the record my suggested edit was *only* to add "on some CPUs like x86" as the statement "code addresses are fundamentally different from data addresses" without a platform-dependency disclaimer is just objectively wrong - for example on every MIPS and ARM chip I know of, where all addresses are completely the same, since those architectures have no concept of "data" vs "code" segments at all. – mtraceur Dec 21 '17 at 10:26
  • @briantist And that's sorta the whole point of the wording as I had suggested it in my edit. Code addresses and data addresses are certainly "fundamentally different" in the *meaning* we give them. But since we're talking about why the C language/standard does not permit certain behavior as a consequence of machine-level implementation realities: it's to allow CPUs with fundamentally different physical representations of data and function pointers - of which *older* x86 CPUs are the most well-known examples, even though most modern CPUs represent them the same way in the hardware. – mtraceur Dec 21 '17 at 10:40
  • @mtraceur To be fair, older x86 (16-bit era) was a bit of a mess with pointers, and most C compilers had some form of explicit support for near and far pointers, regardless of whether they were data or function pointers. Of course, near and far pointers were very different (2 vs 4 bytes, for starters)... That being said, for near pointers in 16-bit x86, you'd be right 100%. – aaaaaa123456789 Dec 23 '17 at 23:56
42

where are these expressions stored such that there addresses can't be retrieved?

Your question is not well-formed.

  • Conceptually

    It's like asking why people can discuss ownership of nouns but not verbs. Nouns refer to things that may (potentially) be owned, and verbs refer to actions that are performed. You can't own an action or perform a thing.

  • In terms of language specification

    Expressions are not stored in the first place, they are evaluated. They may be evaluated by the compiler, at compile time, or they may be evaluated by the processor, at run time.

  • In terms of language implementation

    Consider the statement

    int a = 0;
    

    This does two things: first, it declares an integer variable a. This is defined to be something whose address you can take. It's up to the compiler to do whatever makes sense on a given platform, to allow you to take the address of a.

    Secondly, it sets that variable's value to zero. This does not mean an integer with value zero exists somewhere in your compiled program. It might commonly be implemented as

    xor eax,eax
    

    which is to say, XOR (exclusive-or) the eax register with itself. This always results in zero, whatever was there before. However, there is no fixed object of value 0 in the compiled code to match the integer literal 0 you wrote in the source.

As an aside, when I say that a above is something whose address you can take - it's worth pointing out that it may not really have an address unless you take it. For example, the eax register used in that example doesn't have an address. If the compiler can prove the program is still correct, a can live its whole life in that register and never exist in main memory. Conversely, if you use the expression &a somewhere, the compiler will take care to create some addressable space to store a's value in.


Note for comparison that I can easily choose a different language where I can take the address of an expression.

It'll probably be interpreted, because compilation usually discards these structures once the machine-executable output replaces them. For example Python has runtime introspection and code objects.

Or I can start from LISP and extend it to provide some kind of addressof operation on S-expressions.

The key thing they both have in common is that they are not C, which as a matter of design and definition does not provide those mechanisms.

Useless
  • 64,155
  • 6
  • 88
  • 132
  • 3
    This plays somewhat loose with terminology - in the eyes of the language standard, "object" != "variable". A temporary is also an object. – Oliver Charlesworth Dec 19 '17 at 11:35
  • 11
    Other than that, I think this is the best answer here, so far - it's the only one that says (more or less) "because the standard says so", rather than looking at implementation details as if they were proof of an intrinsic/physical limitation. – Oliver Charlesworth Dec 19 '17 at 11:42
10

Such expressions end up part of the machine code. An expression 2 + 3 likely gets translated to the machine code instruction "load 5 into register A". CPU registers don't have addresses.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 3
    theoretically, If they end up getting translated in machine code, then they should take space in .text section. But they don't! Why is it so? – Gaurav Pathak Dec 19 '17 at 10:02
  • 2
    The same could be said for named variables, so I'm not sure this is a good explanation. – Oliver Charlesworth Dec 19 '17 at 10:08
  • 4
    @Gaurav: what "they"? In Lundin's example, the number `5` might appear as a literal operand as part of a larger machine code instruction. Parts of an instruction don't have addresses as well. (If you're going to nit-pick on semantics. They *do* but you cannot access it.) (Nitpick #2: some architectures may not store the actual number 5 as a byte on its own.) (Nitpick #3: depending on the circumstances, the number `5` may not appear *at all* in the instruction itself. Consider `a = 5*b;` which may be compiled to `lea eax,[ebx+4*ebx]`.) – Jongware Dec 19 '17 at 10:09
  • @usr2564301 "they" --> expression ;-) BTW, nice explanation. Thanks. – Gaurav Pathak Dec 19 '17 at 10:10
  • 2
    I wasn't nitpicking. It's just out of curiosity and lack of knowledge you can say. – Gaurav Pathak Dec 19 '17 at 10:14
  • @Gaurav: no you were not – but if I did not add those proviso's, someone else would have surely pointed it out to me. – Jongware Dec 19 '17 at 10:16
  • 1
    @Gaurav They do take up space in .text, but not as a sole literal, but rather as the instruction op code "load register A" followed by the number 5. If you were to read this address in .text, you would get the whole of that instruction. – Lundin Dec 19 '17 at 10:31
5

It does not really make sense to take the address to an expression. The closest thing you can do is a function pointer. Expressions are not stored in the same sense as variables and objects.

Expressions are stored in the actual machine code. Of course you could find the address where the expression is evaluated, but it just don't make sense to do it.

Read a bit about assembly. Expressions are stored in the text segment, while variables are stored in other segments, such as data or stack.

https://en.wikipedia.org/wiki/Data_segment

Another way to explain it is that expressions are cpu instructions, while variables are pure data.

One more thing to consider: The compiler often optimizes away things. Consider this code:

int x=0;
while(x<10)
    x+=1;

This code will probobly be optimized to:

int x=10;

So what would the address to (x+=1) mean in this case? It is not even present in the machine code, so it has - by definition - no address at all.

klutt
  • 30,332
  • 17
  • 55
  • 95
  • You said: "Expressions are not stored in the same sense as variables and objects.". That's my question. Where are they stored? If they are not stored anywhere then how does compiler/linker know where they are? – Aquarius_Girl Dec 19 '17 at 09:59
  • @Aquarius_Girl Fixed – klutt Dec 19 '17 at 10:06
  • @Aquarius_Girl: a compiler/linker does not need to know where individual expressions are stored. – Jongware Dec 19 '17 at 10:10
  • @usr2564301 who needs to know then? Someone ought to know. – Aquarius_Girl Dec 19 '17 at 10:11
  • 1
    @Aquarius_Girl: that 'someone' is the CPU, when running the program. The compiler compiles, the linker fills in some important addresses such as where each function *starts*. From that point on, its job is done. The semantics of C do not allow pointing "into" functions (probably because it's useless). – Jongware Dec 19 '17 at 10:19
  • "Expressions are stored in the text segment" - **constants** might be, for sure. – Oliver Charlesworth Dec 19 '17 at 11:38
  • 1
    @Aquarius_Girl Let's treat "address" as bricks and mortar - an actual house on a street. If you give someone your address, the contents of that house could be you, or your parents, or your dog. But an expression is the *process* of telling the computer "go second left, third right, and it's the fifth house on the left". When you've evaluated the process, you end up at a house and you can look inside. But the *process* of "go second left etc." does not have a house. As for constants, "second" clearly doesn't live anywhere either. – Graham Dec 19 '17 at 12:51
4

Where are expressions and constants stored if not in memory

In some (actually many) cases, a constant expression is not stored at all. In particular, think about optimizing compilers, and see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

In your particular case of some C code having 2 + 3, most optimizing compilers would have constant folded that into 5, and that 5 constant might be just inside some machine code instruction (as some bitfield) of your code segment and not even have a well defined memory location. If that constant 5 was a loop limit, some compilers could have done loop unrolling, and that constant won't appear anymore in the binary code.

See also this answer, etc...

Be aware that C11 is a specification written in English. Read its n1570 standard. Read also the much bigger specification of C++11 (or later).

Taking the address of a constant is forbidden by the semantics of C (and of C++).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547