Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i], for instance?
From the compiler point of view only side-effects are important. I think both cases should be optimized to the same instructions by a sine compiler with a good optimization. Let's test it out:
void function(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
void function2(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
const struct Particles * const p = &sys->parts;
dist = pow(p->X[i],2.) + pow(p->Y[i],2.) + pow(p->Z[i],2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
both function compile into the same assembly instructions, as shown at godbolt. Throughout this post I am using gcc8.2 with 64-bit x86_64 architecture.
Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? Is the program losing time trying to access these values in the memory because of this mix?
The real answer should be: depends on the architecture. On x86_64 I believe there will be no measurable difference between accessing (not allocating) array members when:
System sys_instance;
System *sys = &sys_instance;
double Xses[NPMAX];
sys->parts.X = Xses;
double Yses[NPMAX];
sys->parts.Y = Yses;
double Zses[NPMAX];
sys->parts.Z = Zses;
and:
System *sys = alloca(sizeof(*sys));
sys->parts.X = alloca(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = alloca(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = alloca(sizeof(*sys->parts.Z) * NPMAX);
and:
System *sys = malloc(sizeof(*sys));
sys->parts.X = malloc(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = malloc(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = malloc(sizeof(*sys->parts.Z) * NPMAX);
or any of the mix of these forms. Whether using malloc
or alloca
- both result in a pointer, that from the accessing point of view is the same. But keep in mind CPU cache and other architecture dependent optimization. Using malloc will result in significantly "slower" allocation.
Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs?
Even if you do:
static System sys_static;
System *sys = &sys_static;
static double X_static[NPMAX];
sys->parts.X = X_static;
static double Y_static[NPMAX];
sys->parts.Y = Y_static;
static double Z_static[NPMAX];
sys->parts.Z = Z_static;
still to your function function
a pointer to sys is passed and all accesses are the same.
In same rare cases and when not using malloc
with sys
initialization having no side-effects, your function declared static and a good optimizer, it could be optimized out and the sys->props.minDist
could be precalculated by the compiler on the compilation stage. But I wouldn't aim for that, unless you want to use C++ with consteval
or constexpr
.
>
If the number of X
and Y
and Z
is the same I would go with what @WhozCraig suggested.
void function(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
const struct Particles * const p = &sys->parts[i];
dist = pow(p->X, 2.) + pow(p->Y, 2.) + pow(p->Z, 2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
This will save cycles needed for multiplication. Also it will reduce the number of malloc's needed to allocate (and resize) elements. The sys->parts[i]
part may be calculated once for the whole dist=
line. In case of sys->parts.X[i]
the sys->parts
may ba calculated once, then for each X
and Y
and Z
the value pointer + sizeof(elem) * i
must be calculated. But, in case of a decent compiler and optimizer, it makes no difference. But really, this approach resulted in different assembly, but the same number of instructions, see godbolt.
Definitely I would declare all the variables that denote size of an object as having size_t
type, that is the loop counter i
as having size_t
type and sys->propc.Npart
would also be size_t
type. They represent the element count, that's what size_t
type is used for.
But I would definitely hand optimize the loop. You are accessing sys->props.Npart
in each loop check. If staying with pointers, I would declare double *X, *Y , *Z;
to be restrict to each other - I suppose you don't expect them to be equal.
Also you accessing sys->procp.minDist
in each loop and conditionally assigning it. You need to deference sys
here only twice - on the beginning and on the end (unless you have some parallel code that depends on minDist
value in mids of calculation, which I hope you don't, cause you have no means of synchronization in your current code). Use a local variable and access sys
as little as possible times you can.
I would replace the pow
calls with variables assignment (so that the variable is derefenced only once) and plain multiplication. Compilers may assume the derefenced variable may change mid-loop if there are any assigments - let's protect against that. However a good optimizer will optimize the pow(..., 2.)
calls.
If performance is so much needed, I would go with:
void function3(System * restrict sys){
double minDist = sys->props.minDist;
for (const struct Particles
* const start = &sys->parts[0],
* const stop = &sys->parts[sys->props.Npart],
* p = start; p < stop; ++p) {
const double X = p->X;
const double Y = p->Y;
const double Z = p->Z;
const double dist = X * X + Y * Y + Z * Z;
if (dist < minDist) {
minDist = dist;
}
}
sys->props.minDist = minDist;
return;
}
Which results in tiny bit of less assembly code, mostly because sys->propc.minDist
is not accessed every time in the loop, no need to use and increment some temporary counter. Use const
s so to give hints to compiler that you won't modify a variable.