Using SSE effectively for rasterization

Question

I've been thinking about using the SSE instruction set to make my 3d software rasterizer faster, but I've never used them before and feel like I am going completely wrong.

I'd like to hear from the more experienced on whether it is an effort that is worth it, and if this code is written poorly:

typedef union _declspec(align(16)) {
    struct {
        float x;
        float y;
        float z;
        float w;
    };
    __m128 m128;
} Vec4_t;

Vec4_t AddVec(Vec4_t* a, Vec4_t *b) {
    __m128 value = _mm_add_ps(a->m128, b->m128);
    return *(Vec4_t*)&value;
}

This is how I'm testing it:

Vec4_t a = { 2.0f, 4.0f, 10.0f, 123.1f };
Vec4_t b = { 6.0f, 12.0f, 16.0f, 64.0f };
Vec4_t c = AddVec(&a, &b);

printf("%f, %f, %f, %f\n", c.x, c.y, c.z, c.w);

which outputs:

8.000000, 16.000000, 26.000000, 187.100006

I honestly have no idea what I'm doing. I'm surprised the code I wrote even worked.

If it's code review, there's a stack exchange site specifically for that. (Vec3_t is a funny name for a 4-element vector...) — Dietrich Epp, Feb 12 '17 at 00:12
@DietrichEpp I'm looking for more of an explanation on how to properly use SSE instructions in conjunction with my own structures. I'll check out the code review area too though. — , Feb 12 '17 at 00:16
You don't have to union them, you can just `_mm_load_ps` those floats later and `_mm_store_ps` them back. — harold, Feb 12 '17 at 00:44
You can often get a huge speedup when using SSE/AVX extensions. But it depends on many factors, including the algorithm and memory access patterns. First, just get the scalar version working correctly. Then you have a baseline for correctness and performance. Then write a vector-based version that you can compare with the scalar. If you're processing streaming data, processing 4 bytes at a time can sometimes get you close to a 4X improvement over processing one byte at a time (oversimplified). — gavinb, Feb 12 '17 at 00:58
Did you really want three component vectors operations as well? You should look into struct of arrays (SoA) and array of structs (AoS) and best of all an array of struct of arrays (AoSoA). — Z boson, Feb 13 '17 at 08:21
The problem with an AoS as you have used is that at some point you will probably need to use horizontal operations e.g. with the dot product. With a SoA you can still use vertical operations for the dot product. Horizontal operations are usually slow with SIMD because they are not single micro-ops but rather micro-code of multiple micro-ops which are slow e.g. `dpps` is a good example of a slow multiple micro-op instruction. — Z boson, Feb 13 '17 at 08:26
see [this](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#) — Amiri, Feb 15 '17 at 14:49

Using SSE effectively for rasterization

0 Answers0