330

The dot (.) operator is used to access a member of a struct, while the arrow operator (->) in C is used to access a member of a struct which is referenced by the pointer in question.

The pointer itself does not have any members which could be accessed with the dot operator (it's actually only a number describing a location in virtual memory so it doesn't have any members). So, there would be no ambiguity if we just defined the dot operator to automatically dereference the pointer if it is used on a pointer (an information which is known to the compiler at compile time afaik).

So why have the language creators decided to make things more complicated by adding this seemingly unnecessary operator? What is the big design decision?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Askaga
  • 6,061
  • 5
  • 25
  • 49
  • 1
    Related: http://stackoverflow.com/questions/221346/what-can-i-use-instead-of-the-arrow-operator - also, you can override -> – Krease Nov 13 '12 at 18:00
  • 19
    @Chris That one's about C++ which of course makes a big difference. But since we're talking about *why* C was designed this way, let's pretend we're back in the 1970s - before C++ existed. – Mysticial Nov 13 '12 at 18:02
  • 8
    My best guess is, that the arrow operator exists to visually express "watch it! you're dealing with a pointer here" – Chris Nov 13 '12 at 18:04
  • 5
    At a glance, I feel that this question is very strange. Not all things are thoughtfully designed. If you keep this style in your whole life, your world would be full of questions. The answer which has got most votes is really informative and clear. But it does not hit the key point of your question. Follow the style of your question, I can ask too many questions. For example, the keyword ‘int’ is the abbreviation of ‘integer’; why does not the keyword ‘double’ also be shorter? – junwanghe Dec 11 '12 at 06:54
  • 3
    @junwanghe This question actually represents valid concern - why does the `.` operator has higher precedence than `*` operator? If it didn't, we could have *ptr.member and var.member. – milleniumbug Dec 18 '12 at 20:30
  • 5
    The . and -> operators represent completely distinct operations. The former indicates an offset that is known at compile time. The latter dereferences pointer at runtime and then applies the offset. Dereferencing a pointer can trigger undefined behavior (and lead to segfault, etc.). Expressing both with . will hide the difference and make the code harder to read and more prone to errors. – martinkunev Sep 11 '14 at 15:26
  • 1
    @junwanghe not everything is defined, but even less is random and meaningless in a human artifact like a programming language. Why 'int' but not 'dou'? Great question! If there is no explanation, then it's just (yet another) proof of capricious/bad language design. It's interesting to compare to other, more consistent and self-coherent languages. – hmijail Sep 04 '20 at 05:53
  • 2
    Try the Rust programming language. – Tianyi Shi Oct 06 '20 at 08:24

4 Answers4

419

I'll interpret your question as two questions: 1) why -> even exists, and 2) why . does not automatically dereference the pointer. Answers to both questions have historical roots.

Why does -> even exist?

In one of the very first versions of C language (which I will refer as CRM for "C Reference Manual", which came with 6th Edition Unix in May 1975), operator -> had very exclusive meaning, not synonymous with * and . combination

The C language described by CRM was very different from the modern C in many respects. In CRM struct members implemented the global concept of byte offset, which could be added to any address value with no type restrictions. I.e. all names of all struct members had independent global meaning (and, therefore, had to be unique). For example you could declare

struct S {
  int a;
  int b;
};

and name a would stand for offset 0, while name b would stand for offset 2 (assuming int type of size 2 and no padding). The language required all members of all structs in the translation unit either have unique names or stand for the same offset value. E.g. in the same translation unit you could additionally declare

struct X {
  int a;
  int x;
};

and that would be OK, since the name a would consistently stand for offset 0. But this additional declaration

struct Y {
  int b;
  int a;
};

would be formally invalid, since it attempted to "redefine" a as offset 2 and b as offset 0.

And this is where the -> operator comes in. Since every struct member name had its own self-sufficient global meaning, the language supported expressions like these

int i = 5;
i->b = 42;  /* Write 42 into `int` at address 7 */
100->a = 0; /* Write 0 into `int` at address 100 */

The first assignment was interpreted by the compiler as "take address 5, add offset 2 to it and assign 42 to the int value at the resultant address". I.e. the above would assign 42 to int value at address 7. Note that this use of -> did not care about the type of the expression on the left-hand side. The left hand side was interpreted as an rvalue numerical address (be it a pointer or an integer).

This sort of trickery was not possible with * and . combination. You could not do

(*i).b = 42;

since *i is already an invalid expression. The * operator, since it is separate from ., imposes more strict type requirements on its operand. To provide a capability to work around this limitation CRM introduced the -> operator, which is independent from the type of the left-hand operand.

As Keith noted in the comments, this difference between -> and *+. combination is what CRM is referring to as "relaxation of the requirement" in 7.1.8: Except for the relaxation of the requirement that E1 be of pointer type, the expression E1−>MOS is exactly equivalent to (*E1).MOS

Later, in K&R C many features originally described in CRM were significantly reworked. The idea of "struct member as global offset identifier" was completely removed. And the functionality of -> operator became fully identical to the functionality of * and . combination.

Why can't . dereference the pointer automatically?

Again, in CRM version of the language the left operand of the . operator was required to be an lvalue. That was the only requirement imposed on that operand (and that's what made it different from ->, as explained above). Note that CRM did not require the left operand of . to have a struct type. It just required it to be an lvalue, any lvalue. This means that in CRM version of C you could write code like this

struct S { int a, b; };
struct T { float x, y, z; };

struct T c;
c.b = 55;

In this case the compiler would write 55 into an int value positioned at byte-offset 2 in the continuous memory block known as c, even though type struct T had no field named b. The compiler would not care about the actual type of c at all. All it cared about is that c was an lvalue: some sort of writable memory block.

Now note that if you did this

S *s;
...
s.b = 42;

the code would be considered valid (since s is also an lvalue) and the compiler would simply attempt to write data into the pointer s itself, at byte-offset 2. Needless to say, things like this could easily result in memory overrun, but the language did not concern itself with such matters.

I.e. in that version of the language your proposed idea about overloading operator . for pointer types would not work: operator . already had very specific meaning when used with pointers (with lvalue pointers or with any lvalues at all). It was very weird functionality, no doubt. But it was there at the time.

Of course, this weird functionality is not a very strong reason against introducing overloaded . operator for pointers (as you suggested) in the reworked version of C - K&R C. But it hasn't been done. Maybe at that time there was some legacy code written in CRM version of C that had to be supported.

(The URL for the 1975 C Reference Manual may not be stable. Another copy, possibly with some subtle differences, is here.)

Praneeth
  • 902
  • 2
  • 11
  • 25
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • 1
    Why didn't it have `*i` be an lvalue of some default type (int?) at address 5? Then (*i).b would have worked the same way. – Random832 Nov 13 '12 at 21:41
  • 5
    @Leo: Well, some people fancy C language as higher-level assembler. At that period in C history the language actually was a higher-level assembler indeed. – AnT stands with Russia Nov 15 '12 at 19:26
  • 'since it attempted to "redefine" a as offset 2 and b as offset 0' -- why offset 2, rather than 1? – bradley.ayers Nov 24 '12 at 22:11
  • 1
    @bradley.ayers: I'm referring to the byte-offset of the data field from the beginning of the struct. If size of `int` is 2 bytes, then sequential members of type `int` will have offsets 0, 2, 4, 6 and so on. – AnT stands with Russia Nov 24 '12 at 23:00
  • You might want to consider flagging this out of wiki before it gets too old. Normally auto-wiki is to prevent abusive bumping, but that isn't the case here. – Mysticial Nov 29 '12 at 17:16
  • 1
    You have written `Later, in K&R C many features originally described in CRM were significantly reworked. The idea of "struct member as global offset identifier" was completely removed.` But in _The Development of the C Language_, Dennis M. Ritchie said "`While it foreshadowed the newer approach to structures, only after it was published did the language support assigning them, passing them to and from functions, and associating the names of members firmly with the structure or union containing them.`". Can you give me a more detailed explanation? – junwanghe Dec 11 '12 at 08:21
  • 1
    @AndreyT Based on the paper _The Development of the C Language_, as if that in K&R, struct member is still a global offset identifier. By the way, do you have the e-book of K&R? If so, could you give me a copy? My email is junwanghe@gmail.com. Thank you very much. – junwanghe Dec 11 '12 at 08:30
  • 39
    Huh. So this explains why many structures in UNIX (e.g., `struct stat`) prefix their fields (e.g., `st_mode`). – icktoofay Jan 19 '13 at 02:20
  • 5
    @perfectionm1ng: It looks like bell-labs.com has been taken over by Alcatel-Lucent and the original pages are gone. I updated the link to another site, although I can't say how long that one will stay up. Anyway, googling for "ritchie c reference manual" usually finds the document. – AnT stands with Russia Oct 09 '13 at 16:37
  • A mad dash down memory lane. I'm trying to reanimate some 40 year old C code where the authors used structs in this manner to implement a kind of union. – Lobachevsky Jun 16 '15 at 14:23
  • When was the meaning change? The current answer makes it sound like the change was made by the first K&R printing. K&R 1st Ed. is 1978, K&R 2nd Ed. is 1988. However, my copy of Expert C Programming (1994) says `*p.f` will "take the f offset from p" (p.45). That sounds like the old behavior. Why would a book from 1994, six years after ANSI and sixteen years after K&R 1st ed. list the change in behavior as a gotcha? Were compilers that slow to update? – Lorem Ipsum Jan 25 '23 at 17:27
53

Beyond historical (good and already reported) reasons, there's is also a little problem with operators precedence: dot operator has higher priority than star operator, so if you have struct containing pointer to struct containing pointer to struct... These two are equivalent:

(*(*(*a).b).c).d

a->b->c->d

But the second is clearly more readable. Arrow operator has the highest priority (just as dot) and associates left to right. I think this is clearer than use dot operator both for pointers to struct and struct, because we know the type from the expression without have to look at the declaration, that could even be in another file.

effeffe
  • 2,821
  • 3
  • 21
  • 44
  • 3
    With nested data types containing both structs and pointers to structs this can make things more difficult as you have to think about choosing the right operator for each submember-access. You you might end up with a.b->c->d or a->b.c->d (i had this problem when using the freetype library - i needed to look up it's source code all the time). Also this doesn't explain why it wouldn't be possible to let the compiler dereference the pointer automatically when dealing with pointers. – Askaga Nov 13 '12 at 18:38
  • @BillAskaga: well I don't think that's harder to understand than all that parentheses, but maybe it's just a matter of taste. Anyway, there's not necessary a reason, almost everything in the language could be made in another way, I just tried to say why the operator is useful. Not everything is strictly necessary, we could even live without switch or for, but they are useful. – effeffe Nov 13 '12 at 18:45
  • 4
    While the facts you are stating are correct, they do not answer my original question in any way. You explain the equality of the a-> and *(a). notations (which has already been explained multiple times in other questions) as well as giving a vague statement about language design being somewhat arbitrary. I didn't find your answer very helpful, therefore the downvote. – Askaga Nov 28 '12 at 20:13
  • @BillAskaga: my point is not the equality of the two different forms, but a possible advantage of the arrow operator, and your question was about why this was added to the language. But if you're looking for a proved historical reason, yeas, my answer can't provide this. Thanks for coming back to explain your decision anyway. – effeffe Nov 28 '12 at 20:35
  • 21
    @effeffe, the OP is saying that the language could have easily interpreted `a.b.c.d` as `(*(*(*a).b).c).d`, rendering the `->` operator useless. So the OP's version (`a.b.c.d`) is equally readable (in comparison to `a->b->c->d`). That's why your answer doesn't answer the OP's question. – Shahbaz Jun 04 '13 at 09:11
  • @Shahbaz hmm, yes, probably the only relevant part of my answer is the last one, but it's still a little bit arbitrary. I understood too late the question. – effeffe Jun 04 '13 at 11:38
  • 6
    @Shahbaz That may be the case for a java programmer, a C/C++ programmer will understand `a.b.c.d` and `a->b->c->d` as two *very* different things: The first is a single memory access to a nested sub-object (there is only a single memory object in this case), the second is three memory accesses, chasing pointers through four likely distinct objects. That's a huge difference in memory layout, and I believe that C is right in distinguishing these two cases very visibly. – cmaster - reinstate monica Oct 17 '17 at 08:45
  • 1
    @cmaster, as much as I am with you on insulting java programmers, your argument is not a very good one. For example, in the case of `a + b`, there is a huge difference in performance if `a` and `b` are `int`s or `float`s, especially where there is no FPU. Should C make a syntactic distinction between `int` plus and `float` plus? How about disallowing integer promotion because it's a hidden `mov`? The point here is that `a.b` can do the job of both `a.b` and `a->b` based on whether `a` is a pointer or not. There is no ambiguity here. – Shahbaz Oct 17 '17 at 14:56
  • 1
    @cmaster, and it's not like you are gaining anything from making a distinction. If you write `a.b` and the compiler gives you an error saying `a` is a pointer, do you suddenly change your mind because `a->b` is more expensive and restructure your code, perhaps writing a macro to avoid passing a pointer to a function or pass the struct by value? Or do you just change `a.b` to `a->b` and compile again? – Shahbaz Oct 17 '17 at 14:58
  • 2
    @Shahbaz I didn't mean that as an insult of the java programmers, they are simply used to a language with fully implicit pointers. Had I been brought up as a java programmer, I'd likely think the same way... Anyway, I actually think that the operator overloading that we see in C is less than optimal. However, I acknowledge that we all have been spoiled by the mathematicians who liberally overload their operators for pretty much everything. I also understand their motivation, as the set of available symbols is rather limited. I guess, in the end it's just the question where you draw the line... – cmaster - reinstate monica Oct 17 '17 at 15:33
  • 2
    @Shahbaz you gain a bit of safety, when you dereference pointers you need to make sure that you aren't dereferencing nullpointers. a.b.c.d is guaranteed to succeed as long as a is fully formed (initialized). a->b->c->d will segfault if any of a,b,c are nulls. Alternatively a->b.c->d will tell you where you are doing indirect memory accesses. – Tomas Pruzina Sep 27 '18 at 13:30
  • `(*(*(*a).b).c).d` is totally silly. That's why everyone who avoids `->` uses `a[0].b[0].c[0].d`. So I'm told. :-D (Hmm, they could use `0[a].0[b].0[c].d` too as an alternative.) – Eljay Oct 15 '19 at 02:36
  • @Eljay They both look silly to me since the array syntax is misleading if there are no arrays, and I don't see why anyone should avoid the arrow operator. Also, the `0[a]` notation, beyond being ridiculous, does not work because of operator precedence. – effeffe Oct 15 '19 at 09:43
  • The Rust programming language allows you to do `a.b.c.d.e` on a nested struct, even without the need of the dereferencing operator `*`. – Tianyi Shi Oct 06 '20 at 08:29
20

C also does a good job at not making anything ambiguous.

Sure the dot could be overloaded to mean both things, but the arrow makes sure that the programmer knows that he's operating on a pointer, just like when the compiler won't let you mix two incompatible types.

mukunda
  • 2,908
  • 15
  • 21
  • 4
    This is the simple and correct answer. C mostly tries to avoid overloading which IMO is one of the best things about C. – jforberg Sep 14 '15 at 11:51
  • 22
    Lots of things in C are ambiguous and fuzzy. There's implicit type conversions, math operators are overloaded, chained indexing does something completely different depending on whether you're indexing a multidimensional array or an array of pointer and anything could be a macro hidinging anything (the uppercasing naming convention helps there but C doesn't). – Petr Skocik Jun 13 '18 at 11:28
  • 1
    With that reasoning, why have the arrow at all? It makes extra sure the programmer knows that they're operating on a pointer if they must do `(*a).b` to access the struct contents. – CivFan Sep 09 '20 at 20:37
  • @CivFan Saying `(*a).b` wouldn't have the same meaning, since we don't need/want to dereference `a` when we want to fetch the value of `b`. – Khoa Vo Jul 12 '21 at 13:40
0

In C there's no technical reason to have a separate -> operator. But it does add clarity - if you see a ->, you know that it's a pointer and can potentially be null, so you might need to check for null before dereferencing it.

In C++, there are classes that pretend to be pointers to some degree (std::unique_ptr, std::shared_ptr, std::optional). They support * and -> like pointers, but they also have their own member functions, accessible with .. Separating the notation this way avoids any possible member name conflicts, and also adds clarity.

HolyBlackCat
  • 78,603
  • 9
  • 131
  • 207
  • I copied my answer from [here](https://stackoverflow.com/a/76099904/2752075), since that question is closed as a duplicate of this one. Such copying [seems to be encouraged](https://meta.stackexchange.com/q/92934/353058). – HolyBlackCat Apr 25 '23 at 12:32