3

I was starting from researching the question "is &((T*)NULL)->member UB in C?". This is an example in my textbook, which introduced the old implementation of offsetof.

I know that offsetof can't be implemented in C++ now(by cppreference page).
But after reading some C++ CWS issues, my problem is kind of becomes "is dereference null pointer UB?".
Also, I think they won't changed the implementation of offsetof from &((T*)NULL)->member in C withought any reason, but I don't know why, maybe because it's UB? But I didn't find a term said &((T*)NULL)->member is UB in C. For C++, I think it's UB if it's not standard layout type.

At the begining, I thought there would be a term explicitly specified sth like "dereference NULL pointer is UB"
However, as I get in deeper, I found that it's more complicated than I thought.
After reading a lot of stackoverflow article reply, I found that the answer is not unified.
Some posts said it's well-defined, some posts said it's UB, some posts said it non-specified.

For those posts said it's well-defined, they quote "CWG issue #232" and "CWG issue #315" as the reasons, like the answer in c++ access static members using null pointer.
For those posts said it's non-specified, they said it didn't be explicitly specified in standard.
For those posts said it's UB, they said the issue have not be included in standard, so it's still UB. Also, they give the term about "If an invalid value has been assigned to the pointer, the behavior of the unary * operator is undefined.".

The example in the stackoverflow above is:

#include <iostream>
class demo {
public:
  static void fun()
  {
    std::cout << "fun() is called\n";
  }
  static int a;
};

int demo::a = 9;

int main()
{
  demo *d = nullptr;
  d->fun();
  std::cout << d->a;
  return 0;
}

Their roughly reason for saying it's well-defined was:

  1. E1->E2 equivalent to (*(E1)).E2
  2. thus, if *d; is legal, then d->fun() is legal.
  3. CWG issue #232 said p = 0; *p; is not inherently an error. An lvalue-to-rvalue conversion would give it undefined behavior.
  4. CWG issue #315 said *d in the above example is not an error when d is null unless the lvalue is converted to an rvalue (7.3.2 [conv.lval]), which it isn't here.
  5. thus *d; is legal, then d->fun() is legal.

the issue was discussed around 2005 years, which still in C++03 spec.
However, in C++20, for ->, the standard explicitly specified the E1 in E1->E2 should be prvalue:

n4861(expr.ref#2): For the second option (arrow) the first expression shall be a prvalue having pointer type.

so I think there may be an lvalue-to-rvalue conversion here since E1 shall be prvalue?

Btw, standard used "dereference null pointer" as an example for undefined behavior before

n1146(intro.execution#4): Certain other operations are described in this International Standard as undefined (for example, the effect of dereferencing the null pointer).

But the example was changed in CWG issue #1102. The reason they said was

There are core issues surrounding the undefined behavior of dereferencing a null pointer. It appears the intent is that dereferencing is well defined, but using the result of the dereference will yield undefined behavior. This topic is too confused to be the reference example of undefined behavior, or should be stated more precisely if it is to be retained.

The issue was discussed in 2010, which have been 13 years ago, so I think it have been a problem for a long time, but sadly, I still can't find the answer now.

All in all, can a language lawyer give me an conclusion about this problem? Is dereference null pointer UB in C++20? For example, &((T*)NULL)->member and the d->fun() above. Or it's IB or unspecified behavior?

Hopefully, the history and the term in standard can be provided.

Edit:
My summary is that this is still an unresolved issue, for now, it is always UB by omission in expr.unary#op-1.sentence-3 which only defines the behavior if there is an object to which the pointer points. But that's probably not the intended specification.

Btw, There is a more recent discussion of this topic with the same outcome: https://github.com/cplusplus/CWG/issues/198

plz check the comment of @user17732522 and the answer by @Brian Bi

Mes
  • 177
  • 7
  • 4
    Ask for one language only please, it is either C++ or C (they are NOT the same language). And it is and probably will forever be UB in C++. And `&((T*)NULL)->member` is something you would never write in C++ even without a null pointer. That would be at best `std::dynamic_cast(&ptr)` (and even using dynamic_cast is usually a sign of a design flaw). And can you explain why you dereferencing a nullptr is a problem? – Pepijn Kramer Aug 13 '23 at 15:31
  • 3
    You are basically referencing everything there is to say about this. CWG 232 is still unresolved. – user17732522 Aug 13 '23 at 15:34
  • @PepijnKramer `dynamic_cast`, not `std::dynamic_cast`, and that's not a possible implementation of `offsetof`. – user17732522 Aug 13 '23 at 15:35
  • @user17732522 yup without the std. – Pepijn Kramer Aug 13 '23 at 15:36
  • To be a bit more language lawyer like, dereferincing a `nullptr` is ok. It is using it the will be the UB. – Pepijn Kramer Aug 13 '23 at 15:37
  • 2
    "_so I think there may be an lvalue-to-rvalue conversion here since E1 shall be prvalue?_": That's not relevant to whether or not there would be UB. The pointer value itself is completely fine as a null pointer. The question is whether a lvalue resulting from the indirection can be a "null lvalue" which currently isn't specified anywhere. So a strict answer would be that it is always UB by omission in https://eel.is/c++draft/expr.unary#op-1.sentence-3 which only defines the behavior if there is an object to which the pointer points. But that's probably not the intended specification. – user17732522 Aug 13 '23 at 15:39
  • @user17732522 thank you very much! So the reason it's UB now is beacause `d->fun()` equivalent to `(*d)->fun()`, and `*d` violate the term you gave? – Mes Aug 13 '23 at 15:50
  • @PepijnKramer thank you very much! but what do you mean dereferencing a `nullptr` is ok? Is that different with dereferencing `NULL`? though I know `nullptr` is different with `NULL`, but idk why dereferencing `nullptr` would not be an problem. – Mes Aug 13 '23 at 15:53
  • 1
    `nullptr` is the standard value for null pointers in C++. NULL is just a `#define 0` and thus not typesafe. – Pepijn Kramer Aug 13 '23 at 15:54
  • @Pepijn Kramer okay, but I still can't understand why dereferencing a `nullptr` is ok, wouldn't it violate the term https://eel.is/c++draft/expr.unary#op-1.sentence-3? – Mes Aug 13 '23 at 16:00
  • BTW, remember than library/compiler implementation might do stuff which would be UB for regular code. – Jarod42 Aug 13 '23 at 16:02
  • 1
    @Mes "_it's UB now_": It has been defined like that since at least C++11. (Assuming you mean `(*d).fun()`.) – user17732522 Aug 13 '23 at 16:03
  • @PepijnKramer `nullptr` and `0` are both _null pointer constants_ that become _null pointer values_ when converted to a pointer type. From which the null pointer value originates doesn't matter. The only difference is that `0` has other non-pointer related behavior which is why `nullptr` was introduced to avoid mistakes. – user17732522 Aug 13 '23 at 16:05
  • @user17732522 As said, it is at least more typesafe because it can be distinguished from an integer value (and help in selecting the right overloads) – Pepijn Kramer Aug 13 '23 at 16:06
  • @Mes I made a mistake – Pepijn Kramer Aug 13 '23 at 16:07
  • @user17732522 oh okay! thank you. Btw, though it's not relevant to the UB, is there an lvalue-to-rvalue conversion for `d->fun()` in C++20? it seems to expect an rvalue for first expression – Mes Aug 13 '23 at 16:18
  • Interesting question since `fun` is a class function, and `a` is a class variable. I suspect (**IANALL**) that `d->` is not relevant (not dereferenced), and is treated as-if it were a `D::`. – Eljay Aug 13 '23 at 16:37
  • 1
    Keep in mind: "undefined behavior" does not mean something bad must happen. It **only** means that the language definition doesn't tell you what that code does. Yes, the traditional implemention of `offset` introduces undefined behavior; that's okay, because the writer of the standard library knows what the compiler that the library ships with does with that code. If you have similar intimate knowledge of what your compiler does you can write code like that, too. But you don't have that knowledge. – Pete Becker Aug 13 '23 at 17:25
  • @Mes Yes, your quoted paragraph seems to clearly state that a lvalue-to-rvalue conversion must happen first on `d`. But that makes sense: In order to dereference the pointer you should need to read the value stored in `d`, which is exactly what the lvalue-to-rvalue conversion does. – user17732522 Aug 13 '23 at 18:43
  • @user17732522 okay, but in CWS issue #315, which was in C++98/03, they said there it is no lvalue-to-rvalue conversion here. I wonder why C++98 and C++20 have different results here. Btw, the same paragraph in C++98(n1146) was: the type of the first expression (the pointer expression) shall be “pointer to class object” – Mes Aug 13 '23 at 21:20
  • 1
    @Mes That's talking about a potential lvalue-to-rvalue conversion on the result of `*p`, not on `p`. What the issue resolution says might be the intent, but even if there is no lvalue-to-rvalue conversion, strict reading of current and previous standard iterations make it still UB because `*`'s specification doesn't cover the case. Even in C++98 the definition of "lvalue" doesn't permit a null value: "_An lvalue refers to an object or function._" ([basic.lval]/2) I also checked now that even in C++98 `->` is defined in terms of `*`. – user17732522 Aug 13 '23 at 21:36
  • 1
    If you want a more recent discussion of this topic (with the same outcome) see e.g. https://github.com/cplusplus/CWG/issues/198. – user17732522 Aug 13 '23 at 21:40
  • `offsetof` is not implementable using standard C++ features. It needs magic support from the specific compiler in use. As a result the defining code sequence on one platform may result in UB, if used on another platform. Treat `offsetof` as if it were a keyword: no universal definition has standard guarantees to work properly. And yes dereferencing `nullptr` is UB. Under embedded environments however, 0 may be a valid memory location; So they may to choose to accept `nullptr` dereference as a none-standard extension. – Red.Wave Aug 14 '23 at 09:38
  • @user17732522 sorry for misleading, I know that no matter there is lvalue-to-rvalue conversion, the behavior is UB. I just wonder why there is seems a difference between C++98 and C++20. According the result of CWS issue#315, I thought `(*d).fun()` are equivalent with `d->fun()`, so I asked "is there lvalue-to-rvalue conversion in `d->fun()`?", since I think if there is a conversion in `d->fun()`, so will the `(*d)` in `(*d).fun()`, but based on the reply you gave, it doesn't seems right? – Mes Aug 14 '23 at 12:21
  • 2
    @Mes In both C++98 and C++20 there `d->fun()` and `(*d).fun()` are equivalent and in both there is an lvalue-to-rvalue conversion on `d`, but none on `*d`. The issue is speaking of a _hypothetical_ lvalue-to-rvalue conversion on `*d`, e.g. as in `*d + 1` if `d` was of type `int*`. – user17732522 Aug 14 '23 at 12:33
  • @user17732522 ahh I see, thank you very much, appreciate for your patient! – Mes Aug 14 '23 at 12:42
  • `&((T*)NULL)->member` does not dereference anything. It is just an offset computation. – user207421 Aug 14 '23 at 23:21
  • @user207421 why it's an offset computation? I didn't find the term specified this both in C and C++ – Mes Aug 15 '23 at 08:24
  • 1
    @Mes Because it evaluates the address of the member relative to zero. No compiler would generate a deference operation from that line of code. (Which is not to say that it is now necessarily legal code.) I guess maybe the x86 `LEA` instruction might be used, which could be problematic. – user207421 Aug 15 '23 at 23:22

1 Answers1

4

As of now, the issue of whether dereferencing a null pointer is UB is still unresolved. And it is not clear whether the direction indicated in CWG 232, i.e. that it should be UB only if an attempt is made to access the value through the result of the dereference, is still the consensus of CWG (although there is at least one situation where it's explicitly legal, namely when the resulting lvalue is of polymorphic type and is the operand of typeid). And if CWG were to agree on a direction, then it is not clear whether EWG would accept that direction. So, really, no one knows the answer.

There is at least one good reason why &((T*)NULL)->member should be UB. An implementation presumably computes &E->m by adding a fixed offset to the value of E. If E is a null pointer, this arithmetic will generate an address value that may be recognized by the hardware as not being valid, resulting in a trap on some implementations on which loading an invalid pointer value into a register causes a trap. I would imagine that an eventual resolution of CWG 232, if one were to actually occur, would clarify that this situation is UB.

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • I was actually trying to follow the path that `->` append pointer arithmetic before. sth like `int *p2 = NULL; p2 += 1;`, the `p2 += 1` is equivalent with `p2 = &NULL[1];`, which is invalid since `NULL` is not an array object, but I cant find a term that sepcified `->` append pointer arithmetic in both C and C++ – Mes Aug 15 '23 at 08:26
  • @Mes The standard does not say what kind of assembly instructions an implementation should use. That is not the job of the standard. – Brian Bi Aug 15 '23 at 15:03
  • Sorry for misleading, I mean, I thought if we said an behavior is UB, it means there would be one or more terms in standard that explicitly or implicitly specified the behavior is UB. Thus, though I can understand "_if `E` is a null pointer, adding an offset on it is invalid_", but if I am going to categorize this problem as UB, I think I need to find a term to prove it, otherwise, I could only categorize it as the "_unresolved_" you said or _sth weird_. And sadly I have looking for the term for several days, but I still can't find the terms, except the term @user17732522 give me above. – Mes Aug 15 '23 at 19:22
  • 1
    @Mes As I said in my answer, the pointer arithmetic thing is a reason why it *should* be UB and why I would expect it to be clarified as UB eventually. But as you say, there is currently no actual specification that it is UB. – Brian Bi Aug 15 '23 at 20:07
  • Ahh no wonder I can't find it in standard. thanks for your clasification :) And I wanna make an confirmation, so, there is no any dereference in `&((T*)NULL)->member`? just like the comment above, it is just an offset computation? – Mes Aug 15 '23 at 20:21
  • 1
    @Mes If `m` is not in a virtual base class of `T`, then `&(p->m)` should normally compile to a simple add instruction, where `p` is a pointer to `T`. If `m` is in a virtual base class, then it may be necessary to read memory to calculate the offset of `m`. – Brian Bi Aug 15 '23 at 20:28
  • I wanna ask an unrelated question, what about `&((T*)NULL)->member` in C? Is there no actual specification that it is UB too? Maybe the situation in C would be easier since there is no static member in C? – Mes Aug 15 '23 at 21:26
  • 1
    @Mes The place to ask an unrelated question is not the comments section. – Brian Bi Aug 15 '23 at 21:59