Are there any drawbacks to using multiple arrow operators (->) chained together?

Question

Suppose we have some structured data that is encapsulated in another structure so that a circular linked list can be formed.

typedef struct Data 
{
    int x;
    int y;
} Data;

typedef struct DataNode 
{
    struct DataNode *next;
    struct Data *data;
} DataNode;

Assuming that the circular linked list is constructed correctly and *head points to a member of the list, are there any drawbacks (performance or otherwise) to using the -> operator in a chain, especially in a loop?

DataNode * findPrevMatching(int x, int y)
{
    // Chained arrow operators in a loop
    while (!(head->next->data->x == x && head->next->data->y == y))  
        head = head->next;

    return head;
}

Would there be any difference if I created local variables so that there are no chained arrows?

DataNode * findPrevMatching(int x, int y)
{   
    DataNode *next = head->next;
    Data *data = next->data;

    while (!(data->x == x && data->y == y))
    {
        // Assign head->next to head
        head = head->next;

        // Assign each local variable, using the new head
        next = head->next;
        data = next->data;
    }

    return head;
}

It is not clear what your problem is. Unless you dereference an invalid pointer like a _null-pointer_, what do you think can happen? Did you read what the arrow-operator does? What did you not understand? — too honest for this site, Sep 19 '16 at 02:47
There are no drawbacks as long as you are sure that the intermediate pointers are valid. — R Sahu, Sep 19 '16 at 02:48
I think the core question here is, does the use of the chained operators prevent good optimizer behavior? This seems like a good question and one for which I certainly don't know the answer. — Kyle Strand, Sep 19 '16 at 03:08
If you have the option to put all your nodes in a pool or preallocated array, locality of reference means the data you want is more likely to be in the cache when you look it up. Keeping indices rather than pointers around would require pointer arithmetic, which might or might not be slower on a given CPU. — Davislor, Sep 19 '16 at 03:12
Optimizers are pretty good; they'd probably manage to achieve what you write themselves. I think your `thisX` and `thisY` are not useful; I'm not wholly convinced by `data`, but you could look at the assembly output, or simply the size of the code. I note that you don't test for the end of the list, so if the entry you're looking for isn't in the list, your code will go haywire (probably it will crash trying to dereference a null pointer). — Jonathan Leffler, Sep 19 '16 at 03:17
@JonathanLeffler The `thisX` and `thisY` variables were kind of overkill, but I wanted to get the point across. Also note that since this is a _circular_ linked list, `head->next` should always point to a valid node assuming that the list is constructed correctly and has at least one member. — hololeap, Sep 19 '16 at 03:21
OK; then you (at least potentially) have a different problem — if the entry being sought isn't in the list, your search is going to continue for a very long time, is it not? — Jonathan Leffler, Sep 19 '16 at 03:23
@JonathanLeffler Good point. Let's just assume that the entry is in the list ;) — hololeap, Sep 19 '16 at 03:25
The first half of my original comment still applies — optimizers are good, and you've probably gone too far with `thisX` and `thisY`. For the others, `next` is necessary, and `data` is in the borderland; I reserve judgement on whether it is beneficial, but tend to think it is of only marginal value. In the old days (30 years ago), it might even have been detrimental; optimizers have improved since then. — Jonathan Leffler, Sep 19 '16 at 03:28
@JonathanLeffler So then do you think that `while (!(next->data->x == x && next->data->y == y))` would be superior to the first example? After a number of chains, does performance drop off (as well as readability)? (Note that I am pretty new to C and don't understand the ins and outs of the language or the compiler). — hololeap, Sep 19 '16 at 03:31
Yes, but I've not conducted measurements to justify that guess/thought. Only measurements really count — and you can do them just as much as I can. — Jonathan Leffler, Sep 19 '16 at 03:39
run the code under `gcc code.c -O3 -o code.obj -c` then `objdump -D code.obj`. I think this kind of pattern has already been generalized by the compiler and the variables will get optimized out anyway. The performance loss will mostly matter due to cache hits/misses due to malloc allocating from discontiguous memory sections, and having to look up uncached memory locations(~20x longer than cached). If there is a difference in your code, it should be near nonexistent if not nonexistent. That said, without optimizations, chaining arrows should be faster(less operations to do same thing). — Dmytro, Sep 19 '16 at 04:04
Your second example where you add some additional variables for _clarity_ is better. A common mistake is to assume that the more "concise" looking version [the first one] will generate better code. In reality, your _second_ version generates _faster_ code. The first has 11 asm insts per loop. The second has only 8. — Craig Estey, Sep 19 '16 at 04:20
Consider having `dataNode` store the actual data, instead of a pointer to it. (Pros and cons to this) — M.M, Sep 19 '16 at 04:45
I made a few edits to the code based on suggestions from comments. @Craig Estey - I made this example have four layers of reference (three -> operators) because , frankly, I couldn't think of a better way to communicate my question. Normally, I wouldn't go past three layers (two -> operators). I agree that too many of these chained together makes it very hard to read. I am familiar with this concept in Ruby as well, where too many chained-together methods can be a real nuisance. — hololeap, Sep 19 '16 at 06:33

score 3 · Answer 1 · answered Sep 19 '16 at 05:14

In one of my comments, I noted that I'd not done any measuring, and only measuring is likely to show anything useful. I've since created one header and 4 variant implementations of the code, as shown:

node.h

typedef struct Data 
{
    int x;
    int y;
} Data;

typedef struct DataNode 
{
    struct DataNode *next;
    struct Data *data;
} DataNode;

extern DataNode *head;

extern DataNode *findPrevMatching(int x, int y);

node1.c

#include "node.h"

DataNode * findPrevMatching(int x, int y)
{
    // Chained arrow operators in a loop
    while (!(head->next->data->x == x && head->next->data->y == y))  
        head = head->next;

    return head;
}

node2.c

#include "node.h"


DataNode * findPrevMatching(int x, int y)
{   
    DataNode *next = head->next;
    Data *data = next->data;
    int thisX = data->x;
    int thisY = data->y;

    while (!(thisX == x && thisY == y))
    {
        // Assign head->next to head
        head = head->next;

        // Assign each local variable, using the new head
        next = head->next;
        data = next->data;
        thisX = data->x;
        thisY = data->y;
    }

    return head;
}

node3.c

#include "node.h"


DataNode * findPrevMatching(int x, int y)
{   
    DataNode *next = head->next;
    Data *data = next->data;

    while (!(data->x == x && data->y == y))
    {
        head = head->next;
        next = head->next;
        data = next->data;
    }

    return head;
}

node4.c

#include "node.h"


DataNode * findPrevMatching(int x, int y)
{   
    DataNode *next = head->next;

    while (!(next->data->x == x && next->data->y == y))
    {
        head = head->next;
        next = head->next;
    }

    return head;
}

Code sizes under different optimization regimes

The code was repeatedly compiled using this command line, but with different values in place of -O0:

gcc -O0 -g -std=c11 -Wall -Wextra -Werror -c node1.c
gcc -O0 -g -std=c11 -Wall -Wextra -Werror -c node2.c
gcc -O0 -g -std=c11 -Wall -Wextra -Werror -c node3.c
gcc -O0 -g -std=c11 -Wall -Wextra -Werror -c node4.c

GCC 6.2.0 on Mac OS X 10.11.6

The sizes are those reported by the size command on Mac OS X 10.11.6 with GCC 6.2.0.

OFLAGS=-O0
__TEXT  __DATA  __OBJC  others  dec     hex
176     0       0       1007    1183    49f     node1.o
239     0       0       1226    1465    5b9     node2.o
208     0       0       1146    1354    54a     node3.o
192     0       0       1061    1253    4e5     node4.o

OFLAGS=-O1
__TEXT  __DATA  __OBJC  others  dec     hex
95      0       0       872     967     3c7     node1.o
125     0       0       1335    1460    5b4     node2.o
118     0       0       1182    1300    514     node3.o
114     0       0       993     1107    453     node4.o

OFLAGS=-O2
__TEXT  __DATA  __OBJC  others  dec     hex
126     0       0       848     974     3ce     node1.o
111     0       0       1410    1521    5f1     node2.o
121     0       0       1135    1256    4e8     node3.o
121     0       0       1005    1126    466     node4.o

OFLAGS=-O3
__TEXT  __DATA  __OBJC  others  dec     hex
126     0       0       848     974     3ce     node1.o
111     0       0       1410    1521    5f1     node2.o
128     0       0       1111    1239    4d7     node3.o
126     0       0       937     1063    427     node4.o

OFLAGS=-Os
__TEXT  __DATA  __OBJC  others  dec     hex
101     0       0       848     949     3b5     node1.o
135     0       0       1293    1428    594     node2.o
112     0       0       1133    1245    4dd     node3.o
107     0       0       1003    1110    456     node4.o

Clearly, no optimization (-O0) ends up with much larger code than with any optimization. On this code, the smallest object size is from the code in node1.c at -O1 optimization. At -O2 and -O3, the code in node2.c is smallest. With -Os and -O1, the first code is the smallest.

Clang from XCode 8.0

The clang from XCode version 8.0 reports:

Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin15.6.0

And the sizes reported are:

OFLAGS=-O0
__TEXT  __DATA  __OBJC  others  dec     hex
189     0       0       925     1114    45a     node1.o
245     0       0       1105    1350    546     node2.o
214     0       0       1023    1237    4d5     node3.o
198     0       0       970     1168    490     node4.o

OFLAGS=-O1
__TEXT  __DATA  __OBJC  others  dec     hex
119     0       0       977     1096    448     node1.o
119     0       0       1071    1190    4a6     node2.o
119     0       0       1033    1152    480     node3.o
119     0       0       999     1118    45e     node4.o

OFLAGS=-O2
__TEXT  __DATA  __OBJC  others  dec     hex
104     0       0       973     1077    435     node1.o
104     0       0       1069    1173    495     node2.o
104     0       0       1031    1135    46f     node3.o
104     0       0       997     1101    44d     node4.o

OFLAGS=-O3
__TEXT  __DATA  __OBJC  others  dec     hex
104     0       0       973     1077    435     node1.o
104     0       0       1069    1173    495     node2.o
104     0       0       1031    1135    46f     node3.o
104     0       0       997     1101    44d     node4.o

OFLAGS=-Os
__TEXT  __DATA  __OBJC  others  dec     hex
104     0       0       973     1077    435     node1.o
104     0       0       1069    1173    495     node2.o
104     0       0       1031    1135    46f     node3.o
104     0       0       997     1101    44d     node4.o

Conclusion

There is absolutely no substitute for experimentation! You can look at the assembler and decide which code you think is best.

Based on your data, it looks like the most chained-together form has the smallest size. Would this also mean it has fewer operations? I also find it _very_ interesting that the 4th example only just lags behind the first in size, and that the smallest size is from the first example using no optimization! Is this because the compiler is foregoing a small amount of stack space for better speed? — hololeap, Sep 19 '16 at 06:49
I was surprised that the GCC code didn't settle on a 'fixed size' like the Clang code did. I've not spent a long time analyzing it all. I didn't check, for example, that the assembler for the `-Os` variants with Clang actually is the same code in each case. Tentatively, the most chained-together form has the smallest size under `-O1` (and a larger smallest size under `-Os`, which is 'optimize for size' — intriguing!). But smallest is not necessarily fastest. (I didn't test `-Ofast` — see the [manual](https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Optimize-Options.html#Optimize-Options)). — Jonathan Leffler, Sep 19 '16 at 06:59

Davislor · Answer 2 · 2016-09-19T08:25:01.167

If you can create an array of nodes, make your previous and next pointers indices within the array, and allocate all your nodes from that pool, that might have performance advantages on a given architecture. It is more likely that a contiguous array will be in the cache, you can tell the OS when you will and won’t be using any node within that block with a function such as posix_madvise() or PrefetchVirtualMemory(), you might be able to use indices smaller than your pointers and get smaller nodes, and your CPU might support indirect addressing that makes looking up an array element as efficient as looking up a pointer.

The worst thing that can happen to correct code that dereferences several pointers to pointers in a row is a series of cache misses (or actually, page faults).

To really answer this question, though, you would want to profile, find out where the program is spending all its time, focus there, and profile again to find out how much time you saved.

score 2 · Answer 3 · edited May 23 '17 at 12:25

Using several field dereferencing operators (e.g. ptr->fld->otherfld->anotherfld) is certainly possible, provided all the involved pointers are valid (otherwise it is the dreaded undefined behavior, and you practically are likely to get a segmentation fault if you are lucky, but more bad things could happen, see also this).

Performance wise, it might have a few minor issues. First, the compiler might not always be able to optimize properly by keeping some intermediate values in registers (e.g. in ptr->fld->otherfield->anotherfield->bluefield = ptr->fld->otherfield->redfield + 2; the ptr->fld->otherfield is likely to be kept in a register and fetched once, but compilers make no guarantee in principle). Second, you might have some CPU cache issues and cache misses. Read the answers section of http://norvig.com/21-days.html (which is also a useful thing to read in totality).

In practice, recent compilers are quite good at optimizing that, if you ask them to (e.g. compile with gcc -O2 if using GCC...), so you practically don't need to introduce your additional local variables (except for readability reasons, which is alone an excellent reason to introduce them). Also, learn more about the restrict keyword.

But don't micro-optimize early and manually your code without benchmarking first.

If you use GCC, you could compile your code with gcc -O2 -S -fverbose-asm and look into the produced assembler code.

Are there any drawbacks to using multiple arrow operators (->) chained together?

In practice, not much if you ask your compiler to optimize (of course, assuming that all involved pointers are valid). However, readability of the source code should always be a concern.

score -1 · Answer 4 · answered Sep 19 '16 at 07:59

Normally, I wouldn't go past three layers (two -> operators). I agree that too many of these chained together makes it very hard to read. I am familiar with this concept in Ruby as well, where too many chained-together methods can be a real nuisance.

I never go beyond one level of -> [and I've been writing C for 35 years]. There is almost always a way to avoid the multiple levels.

In your example, because head is [AFAICT] global, it goes faster if head can be put into a function scope variable for the duration (e.g. local = head at fnc start and head = local at end) when aliasing considerations are at play (i.e.) head must be fetched/stored to memory on each iteration because the compiler can't discount the fact that head may be updated in some way it can't see or anticipate (more on this below).

Also, with the more split out code, it's easier to add conditionally compiled in debug statements and assert checks. And, perhaps, more importantly, it's possible to add blow-by-blow comments that show intent [as you did]. The more complex the "equation", the harder it is to do this.

typedef struct Data {
    int x;
    int y;
} Data;

typedef struct DataNode {
    struct DataNode *next;
    struct Data *data;
} DataNode;

DataNode *head;

#ifdef DEBUG
#define dbgprt(_fmt...)     printf(_fmt)
#else
#define dbgprt(_fmt...)     /**/
#endif

DataNode *
findPrevMatching3(int x, int y)
{
    DataNode *hd = head;
    DataNode *next = hd->next;
    Data *data = next->data;
    int thisX;
    int thisY;

    while (1) {
        thisX = data->x;
        thisY = data->y;
        dbgprt("findPrevMatching3: thisX=%d thisY=%d\n",thisX,thisY);

        if (thisX != x)
            break;
        if (thisY != y)
            break;

        // Assign head->next to head
        hd = hd->next;

        // Assign each local variable, using the new head
        next = hd->next;
        data = next->data;
    }

    head = hd;

    return hd;
}

Without a -DDEBUG, one might think that the above is less efficient [because of the two separate if/break sequences] than the original:

while (!(thisX == x && thisY == y))

but, once again, the optimizer will generate similar code (i.e. 7 insts / loop) It produces a function that is 0x33 bytes in size [on x86 with -O2].

Here is a slightly streamlined version. To me, it is the simplest and easiest to understand version yet. It is also 7 insts / loop, but the code size is reduced to 0x2B bytes in size. So, ironically, it's also the most compact and fastest version.

typedef struct Data {
    int x;
    int y;
} Data;

typedef struct DataNode {
    struct DataNode *next;
    struct Data *data;
} DataNode;

DataNode *head;

#ifdef DEBUG
#define dbgprt(_fmt...)     printf(_fmt)
#else
#define dbgprt(_fmt...)     /**/
#endif

DataNode *
findPrevMatching3(int x, int y)
{
    DataNode *hd = head;
    DataNode *next;
    Data *data;
    int thisX;
    int thisY;

    while (1) {
        next = hd->next;
        data = next->data;

        thisX = data->x;
        thisY = data->y;
        dbgprt("findPrevMatching3: thisX=%d thisY=%d\n",thisX,thisY);

        if (thisX != x)
            break;
        if (thisY != y)
            break;

        // Assign head->next to head
        hd = hd->next;

        // Assign each local variable, using the new head
    }

    head = hd;

    return hd;
}

For more on what I mean by "aliasing considerations", see my answer: Is accessing statically or dynamically allocated memory faster?

Historical note: In "Elements of Programming Style", by Brian Kernigan [co-creator of C] and P.J. Plauger, they say: "Make it right before you make it faster"

Here, we've shown that when you make it right, you also make it faster.