0

My SSE code is just as slow as the standard C one, what am I doing wrong ?

Im running on a Intel i3-6100 CPU, using C with minGW and CLion, im using the -O0 flag.

Im mesuring the performance using the clock() function, both versions are equally fast to about 45 ticks (of over 1000) (SSE:1138 ticks - C:1093ticks). I thaught that SSE somehow messes up the clock() time mesuring, but even by simply counting seconds ther is no differnce.

the function : (swaping comments..)

void vTrace(struct Ray * ray, float t, struct Vec3f * r){
    //__m128 * mr = (__m128 *)r;
    //__m128 mt_m = _mm_set1_ps(t);
    //*mr = _mm_add_ps(*(__m128*)&ray->o, _mm_mul_ps(*(__m128*)&ray->d, mt_m));
    r->x = ray->o.x + ray->d.x*t;
    r->y = ray->o.y + ray->d.y*t;
    r->z = ray->o.z + ray->d.z*t;

}

the benchmarking code:

float benchmark_t = 1;
struct Ray benchmark_ray;
vInit3f(&benchmark_ray.o, 0.2, 0.23, 1.4);
vInit3f(&benchmark_ray.d, 0.2, 0.23, 1.4);
ticks = clock();
i = 0;
while(i < 1000000000 ){
    vTrace(&benchmark_ray, benchmark_t, &benchmark_ray.o);
    i ++;
}
printf("TIME : %i ticks\n", (clock()-ticks));
printVec("result", benchmark_ray.o);

the structures :

struct Vec3f{
    float x;
    float y;
    float z;
    float w;//just for SSE
};

struct Ray{
    struct Vec3f o;
    struct Vec3f d;
    struct Vec3f inverse_d;
};

Using SSE the performance should be about 4-times that fast why is there no performance gain ?

noName
  • 132
  • 1
  • 9
  • `_mm_add_ps(*(__m128*)&ray->o, _mm_mul_ps(*(__m128*)&ray->d, mt_m));` has some major strict aliasing issues. – S.S. Anne Aug 01 '19 at 15:51
  • how would I fix that ? – noName Aug 01 '19 at 15:56
  • memcpy would be the naive solution, but you should probably use SSE types instead of structures. – S.S. Anne Aug 01 '19 at 15:58
  • Have you looked at the assembly the two versions produce? – Shawn Aug 01 '19 at 16:02
  • I disassebled the C code and whyever it got autovectorized if I understand correctly...or at least its using the xmm0/xmm1 registers. – noName Aug 01 '19 at 16:19
  • memcpy added a 20% performance overhead...in my case – noName Aug 01 '19 at 16:26
  • 1
    Did you seriously compile with `-O0`? This essentially means "disable optimization". Try `-O2` or `-O3` (or maybe `-Os`) instead. – chtz Aug 01 '19 at 20:05
  • @JL2210 `__m128*` is like `char*` for aliasing rules: it's allowed to alias anything. GCC defines it with `__attribute__((vector_size(16), may_alias)` – Peter Cordes Aug 01 '19 at 21:05
  • @noName: scalar math also uses SSE instructions like `addss` (scalar single) with XMM regs. Seeing XMM regs tells you nothing about auto-vectorization. Your scalar code leaves `.w` unmodified so GCC can't easily auto-vectorize. And certainly won't with `-O0`, that's your problem. – Peter Cordes Aug 01 '19 at 21:06

1 Answers1

-2

The code somehow autovectorized, I dont know why but it did. So there was no great performance difference. (next time step in the assembly code first)

noName
  • 132
  • 1
  • 9