Using struct of structs efficiently when passing into function and accessing values stored within in C

Question

I have a code that is organized in such a way that I have a struct of structs, and in my main method I have a number of functions that take pointer to the main struct as an argument. I am wondering certain choices I made in such an organization would affect the speed of my code adversely. A minimal example code for the sake of my question would look like this:

#define NPMAX 50000

typedef struct Particles{
    double *X, *Y, *Z;
} Particles;

typedef struct Properties{
    int Npart;
    double Box[3];
    double minDist;
} Properties;

typedef struct System{
    Properties props;
    Particles parts;
} System;

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

With the following main method:

int main(){
    System sys;
    sys.parts.X = (double *)malloc(sizeof(double) * NPMAX);
    sys.parts.Y = (double *)malloc(sizeof(double) * NPMAX);
    sys.parts.Z = (double *)malloc(sizeof(double) * NPMAX);

    //... some code to populate sys->parts.X, Y, and Z ... 

    sys.props.Npart = 1000;
    sys.props.Box[0] = 10.; //etc.
    sys.props.minDist = 9999.;

    function(&sys);

    // some file I/O

    return;

}

My question is, given this data structure, have I organized my function in the best possible way for efficiency? I mean that speed-wise, not in terms of memory. More specifically:

Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i], for instance?
Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? Is the program losing time trying to access these values in the memory because of this mix?
Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs?

I have access to intel compilers in addition to gcc and I'm compiling with the -O3 flag.

For cache friendliness, you're particles would be better off as a single dynamic sequence of `struct XYZ { double x; double y; double z; }`. Clearly you're not planning having more Z's than X's in this endeavor. And since they're all accessed linearly, it just makes sense. `parts.X[i]` becomes `parts[i].x`, for example. Regardless, this seems more appropriate for [code-review](https://codereview.stackexchange.com/) than here. — WhozCraig, Nov 17 '18 at 02:27
[don't cast the result of `malloc` in C](https://stackoverflow.com/q/605845/995714) — phuclv, Nov 17 '18 at 02:54
@WhozCraig Thanks, you are right -- how do I move this to code-review? — sodiumnitrate, Nov 17 '18 at 18:44

Pascal de Kloe · Answer 1 · 2018-11-18T15:33:20.643

The memory layout looks fine. With only a few allocations the structure doesn't matter that much. Those double arrays do offer a nice option for vector computing with a temporary array in between.

// collect computations first
double dist[NPMAX];
// process 8 64-bit floating-points at a time
int n = sys->props.Npart & ~7;
for(int i = 0; i < n; i += 8){
    _m512d xsq = _mm512_sqrt_pd(&sys->parts.X[i]);
    _m512d ysq = _mm512_sqrt_pd(&sys->parts.Y[i]);
    _m512d zsq = _mm512_sqrt_pd(&sys->parts.Z[i]);
    dist[i] = xsq + ysq + zsq;
}
// deal with remainders (if any)
for (int i = n; i < sys->props.Npart; i++)
    dist[i] = sqrt(sys->parts.X[i]) + sqrt(sys->parts.Y[i]) + sqrt(sys->parts.Z[i]);

// then find lowest
for (int i = 0; i < sys->props.Npart; i++)
    if(dist[i] < sys->props.minDist) sys->props.minDist = dist[i];

Thanks for the answer. I'm a bit lost, however -- why are we doing it 8 at a time? — sodiumnitrate, Nov 19 '18 at 14:56

KamilCuk · Accepted Answer · 2018-11-19T15:15:34.673

Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i], for instance?

From the compiler point of view only side-effects are important. I think both cases should be optimized to the same instructions by a sine compiler with a good optimization. Let's test it out:

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

void function2(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        const struct Particles * const p = &sys->parts;
        dist = pow(p->X[i],2.) + pow(p->Y[i],2.) + pow(p->Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

both function compile into the same assembly instructions, as shown at godbolt. Throughout this post I am using gcc8.2 with 64-bit x86_64 architecture.

Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? Is the program losing time trying to access these values in the memory because of this mix?

The real answer should be: depends on the architecture. On x86_64 I believe there will be no measurable difference between accessing (not allocating) array members when:

System sys_instance;
System *sys = &sys_instance;
double Xses[NPMAX];
sys->parts.X = Xses;
double Yses[NPMAX];
sys->parts.Y = Yses;
double Zses[NPMAX];
sys->parts.Z = Zses;

and:

System *sys = alloca(sizeof(*sys));
sys->parts.X = alloca(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = alloca(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = alloca(sizeof(*sys->parts.Z) * NPMAX);

and:

System *sys = malloc(sizeof(*sys));
sys->parts.X = malloc(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = malloc(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = malloc(sizeof(*sys->parts.Z) * NPMAX);

or any of the mix of these forms. Whether using malloc or alloca - both result in a pointer, that from the accessing point of view is the same. But keep in mind CPU cache and other architecture dependent optimization. Using malloc will result in significantly "slower" allocation.

Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs?

Even if you do:

static System sys_static;
System *sys = &sys_static;
static double X_static[NPMAX];
sys->parts.X = X_static;
static double Y_static[NPMAX];
sys->parts.Y = Y_static;
static double Z_static[NPMAX];
sys->parts.Z = Z_static;

still to your function function a pointer to sys is passed and all accesses are the same.

In same rare cases and when not using malloc with sys initialization having no side-effects, your function declared static and a good optimizer, it could be optimized out and the sys->props.minDist could be precalculated by the compiler on the compilation stage. But I wouldn't aim for that, unless you want to use C++ with consteval or constexpr.

>

If the number of X and Y and Z is the same I would go with what @WhozCraig suggested.

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        const struct Particles * const p = &sys->parts[i];
        dist = pow(p->X, 2.) + pow(p->Y, 2.) + pow(p->Z, 2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

This will save cycles needed for multiplication. Also it will reduce the number of malloc's needed to allocate (and resize) elements. The sys->parts[i] part may be calculated once for the whole dist= line. In case of sys->parts.X[i] the sys->parts may ba calculated once, then for each X and Y and Z the value pointer + sizeof(elem) * i must be calculated. But, in case of a decent compiler and optimizer, it makes no difference. But really, this approach resulted in different assembly, but the same number of instructions, see godbolt.

Definitely I would declare all the variables that denote size of an object as having size_t type, that is the loop counter i as having size_t type and sys->propc.Npart would also be size_t type. They represent the element count, that's what size_t type is used for.

But I would definitely hand optimize the loop. You are accessing sys->props.Npart in each loop check. If staying with pointers, I would declare double *X, *Y , *Z; to be restrict to each other - I suppose you don't expect them to be equal.

Also you accessing sys->procp.minDist in each loop and conditionally assigning it. You need to deference sys here only twice - on the beginning and on the end (unless you have some parallel code that depends on minDist value in mids of calculation, which I hope you don't, cause you have no means of synchronization in your current code). Use a local variable and access sys as little as possible times you can.

I would replace the pow calls with variables assignment (so that the variable is derefenced only once) and plain multiplication. Compilers may assume the derefenced variable may change mid-loop if there are any assigments - let's protect against that. However a good optimizer will optimize the pow(..., 2.) calls.

If performance is so much needed, I would go with:

void function3(System * restrict sys){
    double minDist = sys->props.minDist;

    for (const struct Particles 
            * const start = &sys->parts[0],
            * const stop = &sys->parts[sys->props.Npart],
            * p = start; p < stop; ++p) {
        const double X = p->X;
        const double Y = p->Y;
        const double Z = p->Z;
        const double dist = X * X + Y * Y + Z * Z;
        if (dist < minDist) {
            minDist = dist;
        }
    }

    sys->props.minDist = minDist;
    return;
}

Which results in tiny bit of less assembly code, mostly because sys->propc.minDist is not accessed every time in the loop, no need to use and increment some temporary counter. Use consts so to give hints to compiler that you won't modify a variable.

Thanks for the detailed analysis -- it's very helpful and educational! — sodiumnitrate, Nov 19 '18 at 14:54

Using struct of structs efficiently when passing into function and accessing values stored within in C

2 Answers2