Intuition about memory layout for fast SIMD / data oriented design

Question

I have been watching data-oriented-design talks recently, but I never understood the reasoning behind their unanimously chosen memory layout.

Lets say we have a 3D animation to render, and in each frame we need to re-normalize our orientation vectors.

The "Scalar code"

They always show code that might look something like this:

let scene = [{"camera1", vec4{1, 1, 1, 1}}, ...]

for object in scene
    object.orientation = normalize(object.orientation)

So far so good... The memory at &scene might look roughly thus:

[string,X,Y,Z,W,string,X,Y,Z,W,string,X,Y,Z,W,...]

"SSE aware code"

Every talk then shows the improved, cookie-cutter, version:

let xs = [1, ...]
let ys = [1, ...]
let zs = [1, ...]
let ws = [1, ...]
let scene = [{"camera1", ptr_vec4{&xs[1], &ys[1], &zs[1], &ws[1]}}, ...]

for (o1, o2, o3, o4) in scene
    (o1, o2, o3, o4) = normalize_sse(o1, o2, o3, o4)

Which, due to it's memory layout, is not only more memory-efficient, but can also process our scene 4 objects at a time.
Memory at &xs, &ys, &zs, and &ws

[X,X,X,X,X,X,...]
[Y,Y,Y,Y,Y,Y,...]
[Z,Z,Z,Z,Z,Z,...]
[W,W,W,W,W,W,...]

But why 4 separate arrays?

If the __m128 (packed-4-singles) is the predominant type in engines,
    which i believe it is;
and if the type is 128 bits long,
    which it definitely is;
and if the cache line width / 128 = 4,
    which it almost always does;
and if x86_64 is only capable of writing a full cache line,
    which I am almost certain of
- why is the data not structured as follows instead?!

Memory at &packed_orientations:

[X,X,X,X,Y,Y,Y,Y,Z,Z,Z,Z,W,W,W,W,X,X,...]
 ^---------cache-line------------^

~~I have no benchmark to test this on, and I don't understand intrinsics enough to even try~~, but by my intuition, should this not be way faster? We would be saving 4x page loads and writes, simplifying allocations, and saving pointers, and the code would be simpler since instead of 4 pointers we can do pointer addition. Am I wrong?

Thank you! :)

It depends on what kind of operations you are going to be performing as to whether AoS or SoA is more efficient. Relevant: [1](https://stackoverflow.com/q/5323154/253056), [2](https://stackoverflow.com/q/17924705/253056), [*et al*](https://stackoverflow.com/search?q=aos+soa). — Paul R, Feb 04 '19 at 13:32
@PaulR Well, data-oriented-design is all about corralling data into the pattern of sequential access of structures. In fact If you wanted to access each component separately, that should then become a structure. This granularity is a sort of invariant to my question. But my idea is quite different. — , Feb 04 '19 at 13:42
For example, because you may want to perform different operations on different components. Consider RGBA images: alpha channel is almost always treated differently than r/g/b channels. The memory layout you're proposing would require additional bit-shifting and bit-extracting. This is harder to code and slower to execute than in classical per-component layout. — gudok, Feb 04 '19 at 14:01
@gudok well in that case, if we were to strictly adhere to the DOD pattern, we would have to separate RGB from the A channel. Even if we didn't, we would not need additional work - the access pattern would still work, but you would definitely waste 75% of your cache reads and writes. Interesting... — , Feb 04 '19 at 14:09
@arctiq: gudok has already touched on this, but there are image processing algorithms that can be applied to a pixel at a time, and then there are others where it's more efficient to process one plane at a time. If you need to do both kinds then you have a dilemma as to whether to use AoS or SoA. If you only want to use one kind of access pattern though then the choice is relatively simple. — Paul R, Feb 04 '19 at 15:59
This layout is sometimes called AoSoA because it's like an array of small SoA pieces, there are some advantages but not 4x — harold, Feb 04 '19 at 16:27
What does IA64 (Itanium) have to do with anything here? SSE is x86 / x86-64 SIMD. — Peter Cordes, Feb 04 '19 at 23:55

score 3 · Accepted Answer · answered Feb 04 '19 at 13:56

3

The amount of data you need to get through your memory subsystem is identical no matter whether you do 4 separate arrays or your suggested interleaving. You therefore don't save page loads or writes (I don't see why the "separate arrays" case should read and write each page or cache line more than once).

You do disperse the memory transfers more - you might have 1 L1 cache miss every iteration in your case and 4 cache misses every 4th iteration in the "separate arrays" case. I don't know which one would be preferred.

Anyway, the main point is to not have unnecessary memory pushed through your caches that you don't interact with. In your example, having string values that are neither read nor written but still pushed through the caches needlessly takes up bandwidth.

answered Feb 04 '19 at 13:56

Max Langhof

23,383
5
39
72

Thanks, the strings in the scalar code was just a setup for the question. I definitely agree, and understand that removing unused data is essential for fast code. The rest I'll have to think about. – Feb 04 '19 at 14:03
Oh, of course, I get it! All the randomly scattered arrays will be loaded into L1, and then the pre-fetcher just has to keep track of 4 pointers and try to keep up. Wow, can it really do that?! – Feb 04 '19 at 14:13
1

@arctiq: Depends very much on the CPU. High-end server CPU's are certainly capable; a cheap CPU for embedded systems might not even have a prefetcher. – MSalters Feb 04 '19 at 14:27
2

@arctiq This might just be `clang` being stupid (why does it clobber and recalculate `rax` every time? I guess there's a reason that I don't see), but I just cannot get your interleaved version to use less instructions than the "separate arrays" version: https://godbolt.org/z/WehVhR – Max Langhof Feb 04 '19 at 14:45
Your godbolt example uses non-inline function calls, so looking at the code-gen seems pretty useless. Maybe you were looking at something else, because the code you linked doesn't do anything with RAX in either function. – Peter Cordes Feb 05 '19 at 07:30
@PeterCordes How would you write the code to generate "relevant" assembly for comparison? I agree that the code is probably not exactly representative of a real use case, but I couldn't massage it to what my intuition would tell me (namely that incrementing four separate pointers takes more instructions or registers than incrementing one pointer at four constant offsets). – Max Langhof Feb 05 '19 at 08:24
@PeterCordes Like, at [this](https://godbolt.org/z/ivUgaf) point I just gave up because `(&tomodify - offset)` should be trivial to calculate for the consecutive `rgba4` members (once you have the first difference, the others should just be "add 128 bits" every time), yet it uses more `add`s. Is there some aliasing I'm missing here? – Max Langhof Feb 05 '19 at 08:32
@MaxLanghof: It's probably weird because of `volatile` inside the loop. Oh wait, `volatile mm128* output;` is a pointer *to* volatile, but the pointer itself it not volatile, so it is legal for gcc/clang to optimize the loop that modifies it 4 times down to one store. That seems to just be clang being weird; gcc's asm is simpler (but still not just one add). https://godbolt.org/z/9_XluC. IDK, I don't know why you're looking at loops that only mess around with pointers instead of *actually* reading memory and e.g. summing `x` values, or sum-of-squares vector length. – Peter Cordes Feb 05 '19 at 08:54
Your "alternative 2" compiles very nicely with gcc https://godbolt.org/z/62L2Yn. 4x movups loads with increasing offsets in the addressing mode, doubling each of them with `addps same,same`, then 4x `movups` stores with the same addressing mode. vs. the separate arrays version incrementing 4 pointers. (And with overlap checks with 2 versions of the loop.) – Peter Cordes Feb 05 '19 at 08:57
1

@PeterCordes I was only messing with the pointers and not the memory because the discussion was about whether the computation of the adresses was better one way or another (i.e. accessing the memory afterwards is the same either way), and the "4 separate pointers" case should've _certainly_ lost that one (even if I messed up the volatile). If I can't get _that_ to behave as expected then clearly it's over my head. And yeah, that "alternative 2" gcc is what I was looking for. I guess moral of the story is "check your resulting asm". – Max Langhof Feb 05 '19 at 12:03

Peter Cordes · Answer 2 · 2019-02-05T03:16:40.093

One major downside to interleaving on the vector-width is that you need to change the layout to take advantage of wider vectors. (AVX, AVX512).

But yes, when you're purely manually vectorizing (with no loops that a compiler might auto-vectorize with its choice of vector width), this could maybe be worth it if all your (important) loops always use all struct members.

Otherwise Max's point applies: a loop that touches only x and y will waste bandwidth on the z and w members.

It won't be way faster, though; with a reasonable amount of loop unrolling, indexing 4 arrays or incrementing 4 pointers is barely worse than 1. HW prefetch on Intel CPUs can track one forward + 1 backward stream per 4k page, so 4 input streams is basically fine.

(But L2 is 4-way associative in Skylake, down from 8 in earlier, so more than 4 input streams all with the same alignment relative to a 4k page would cause conflict misses / defeat prefetching. So with more than 4 large / page-aligned arrays, an interleaved format could avoid that problem.)

For small arrays, where the whole interleaved thing fits in one 4k page, yes that's a potential advantage. Otherwise it's about the same total amount of pages touched and potential TLB misses, with one 4x as often instead of coming in groups of 4. This might well be better for TLB prefetch, if it can be doing one page-walk ahead of time instead of being swamped with multiple TLB misses coming at the same time.

Tweaking the SoA struct:

It may help to let the compiler know that the memory pointed to by each of the pointers is non-overlapping. Most C++ compilers (including all 4 major x86 compilers, gcc/clang/MSVC/ICC) support __restrict as a keyword with the same semantics as C99 restrict. Or for portability, use #ifdef / #define to define a restrict keyword as empty or __restrict or whatever, as appropriate for the compiler.

struct SoA_scene {
        size_t size;
        float *__restrict xs;
        float *__restrict ys;
        float *__restrict zs;
        float *__restrict ws;
};

This can definitely help with auto-vectorization, otherwise the compiler doesn't know that xs[i] = foo; doesn't change the value of ys[i+1] for the next iteration.

If you read those vars into local variables (so the compiler is sure that pointer assignments don't modify the pointer itself in the struct), you might declare them as float *__restrict xs = soa.xs; and so on.

The interleaved format inherently avoids this aliasing possibility.

Makes me wonder: If you use `std::vector` (no custom allocator) instead of `float*` (`__restrict`), does the compiler understand the impossibility of aliasing? — Max Langhof, Feb 05 '19 at 08:39
@MaxLanghof: that might depend on the quality of your c++ library. Except no, `std::vector` can't use `__restrict` internally because that would break code that has `float*` into a std::vector. I don't think there's anything a std::vector library implementation can do on current compilers to tell it that it doesn't alias with the storage for other std::vectors of the same type. — Peter Cordes, Feb 05 '19 at 08:43

score 1 · Answer 3 · answered Feb 04 '19 at 14:37

1

One of the things not mentioned yet is that memory access has quite a bit of latency. And of course, when reading from 4 pointers, the result is available when the last value arrives. So even if 3 out of 4 values are in cache, the last value may need to come from memory and stall your whole operation.

That's why SSE doesn't even support this mode. All your values have to be contiguous in memory, and for quite a while they had to be aligned (so they couldn't cross a cache line boundary).

Importantly, that means your example (Structure of Arrays) does not work in SSE hardware. You can't use element [1] from 4 different vectors in a single operation. You can use elements [0] to [3] from a single vector.

answered Feb 04 '19 at 14:37

MSalters

173,980
10
155
350

1

The point of the code is that you grab 4 `x` at once into an `mm128`, 4 `y` at once etc., then you do the same operation (that normally involves 1 `x`, 1 `y` etc.) four times at once. (At no point do you put `x[0]`, `y[0]` etc. into the _same_ `mm128`.) The question is about how to best arrange these groups of 4 in memory. – Max Langhof Feb 04 '19 at 14:41
Interesting point about latency. You misunderstand, each orientation field is in a different __mm128, thusly for addition: [x1,x2,x3,x4]+[y1,y2,y3,y4]; instead of [x1,y1,z1,w1]+[x2,y2,z2,w2]. Sorry, the second example would not make sense for any step of normalization but that is sort of an welcome consequence of the layout. – Feb 04 '19 at 14:45
1

@arctiq: Ok, so you say that the call to `normalize_sse(o1, o2, o3, o4)` is actually normalizing 4 arrays, not a single 4-vector? Max has a point above - it's the implementation which dictates the best memory layout. I've got some particularly gnarly AVX code here where I have a partially-transposed matrix that's neither row-major nor column-major. It only makes sense once you have figured out how that matrix is used, when suddenly the accesses become sequential. – MSalters Feb 04 '19 at 14:50

score 1 · Answer 4 · 2019-02-09T00:19:49.967

I have implemented a simple benchmark for both methods.

Result: The striped layout is at best 10 % faster than the standard layout*. But with SSE4.1 we can do much better.

*When compiled with gcc -Ofast on an i5-7200U cpu.

The structure is slightly easier to work with, but much less versatile. It might, however, have a bit of an advantage in a real scenario, once the allocator is sufficiently busy.

Striped layout

Time 4624 ms

Memory usage summary: heap total: 713728, heap peak: 713728, stack peak: 2896
         total calls   total memory   failed calls
 malloc|          3         713728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          1         640000

#include <chrono>
#include <cstdio>
#include <random>
#include <vector>
#include <xmmintrin.h>

/* -----------------------------------------------------------------------------
        Striped layout [X,X,X,X,y,y,y,y,Z,Z,Z,Z,w,w,w,w,X,X,X,X...]
----------------------------------------------------------------------------- */

using AoSoA_scene = std::vector<__m128>;

void print_scene(AoSoA_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.
        auto &&punned_data = reinterpret_cast<float const *>(scene.data());
        auto scene_size = std::size(scene);

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene_size, 8lu); ++j) {
                for(size_t i = 0lu; i < 4lu; ++i) {
                        printf("%10.3e ", punned_data[j + 4lu * i]);
                }
                printf("\n");
        }
        if(scene_size > 8lu) {
                printf("(%lu more)...\n", scene_size - 8lu);
        }
        printf("\n");
}

void normalize(AoSoA_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size(); i += 4lu) {
                __m128 xs = scene[i + 0lu];
                __m128 ys = scene[i + 1lu];
                __m128 zs = scene[i + 2lu];
                __m128 ws = scene[i + 3lu];

                __m128 xxs = _mm_mul_ps(xs, xs);
                __m128 yys = _mm_mul_ps(ys, ys);
                __m128 zzs = _mm_mul_ps(zs, zs);
                __m128 wws = _mm_mul_ps(ws, ws);

                __m128 xx_yys = _mm_add_ps(xxs, yys);
                __m128 zz_wws = _mm_add_ps(zzs, wws);

                __m128 xx_yy_zz_wws = _mm_add_ps(xx_yys, zz_wws);

                __m128 norms = _mm_sqrt_ps(xx_yy_zz_wws);

                scene[i + 0lu] = _mm_div_ps(xs, norms);
                scene[i + 1lu] = _mm_div_ps(ys, norms);
                scene[i + 2lu] = _mm_div_ps(zs, norms);
                scene[i + 3lu] = _mm_div_ps(ws, norms);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        // Has to be a multiple of 4!   -- No edge case handling.
        std::vector<__m128> scene(40'000);

        for(size_t i = 0lu; i < std::size(scene); ++i) {
                scene[i] = _mm_set_ps(randf(), randf(), randf(), randf());
        }

        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

SoA layout

Time 4982 ms

Memory usage summary: heap total: 713728, heap peak: 713728, stack peak: 2992
         total calls   total memory   failed calls
 malloc|          6         713728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          4         640000

#include <chrono>
#include <cstdio>
#include <random>
#include <vector>
#include <xmmintrin.h>

/* -----------------------------------------------------------------------------
        SoA layout [X,X,X,X,...], [y,y,y,y,...], [Z,Z,Z,Z,...], ...
----------------------------------------------------------------------------- */

struct SoA_scene {
        size_t size;
        float *xs;
        float *ys;
        float *zs;
        float *ws;
};

void print_scene(SoA_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene.size, 8lu); ++j) {
                printf("%10.3e ", scene.xs[j]);
                printf("%10.3e ", scene.ys[j]);
                printf("%10.3e ", scene.zs[j]);
                printf("%10.3e ", scene.ws[j]);
                printf("\n");
        }
        if(scene.size > 8lu) {
                printf("(%lu more)...\n", scene.size - 8lu);
        }
        printf("\n");
}

void normalize(SoA_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size; i += 4lu) {
                __m128 xs = _mm_load_ps(&scene.xs[i]);
                __m128 ys = _mm_load_ps(&scene.ys[i]);
                __m128 zs = _mm_load_ps(&scene.zs[i]);
                __m128 ws = _mm_load_ps(&scene.ws[i]);

                __m128 xxs = _mm_mul_ps(xs, xs);
                __m128 yys = _mm_mul_ps(ys, ys);
                __m128 zzs = _mm_mul_ps(zs, zs);
                __m128 wws = _mm_mul_ps(ws, ws);

                __m128 xx_yys = _mm_add_ps(xxs, yys);
                __m128 zz_wws = _mm_add_ps(zzs, wws);

                __m128 xx_yy_zz_wws = _mm_add_ps(xx_yys, zz_wws);

                __m128 norms = _mm_sqrt_ps(xx_yy_zz_wws);

                __m128 normed_xs = _mm_div_ps(xs, norms);
                __m128 normed_ys = _mm_div_ps(ys, norms);
                __m128 normed_zs = _mm_div_ps(zs, norms);
                __m128 normed_ws = _mm_div_ps(ws, norms);

                _mm_store_ps(&scene.xs[i], normed_xs);
                _mm_store_ps(&scene.ys[i], normed_ys);
                _mm_store_ps(&scene.zs[i], normed_zs);
                _mm_store_ps(&scene.ws[i], normed_ws);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        // Has to be a multiple of 4!   -- No edge case handling.
        auto scene_size = 40'000lu;
        std::vector<float> xs(scene_size);
        std::vector<float> ys(scene_size);
        std::vector<float> zs(scene_size);
        std::vector<float> ws(scene_size);

        for(size_t i = 0lu; i < scene_size; ++i) {
                xs[i] = randf();
                ys[i] = randf();
                zs[i] = randf();
                ws[i] = randf();
        }

        SoA_scene scene{
                scene_size,
                std::data(xs),
                std::data(ys),
                std::data(zs),
                std::data(ws)
        };
        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

AoS layout

Since SSE4.1 there seems to be a third option -- by far the simplest and fastest one.

Time 3074 ms

Memory usage summary: heap total: 746552, heap peak: 713736, stack peak: 2720
         total calls   total memory   failed calls
 malloc|          5         746552              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          2         672816
Histogram for block sizes:
    0-15              1  20% =========================
 1024-1039            1  20% =========================
32816-32831           1  20% =========================
   large              2  40% ==================================================


/* -----------------------------------------------------------------------------
        AoS layout [{X,y,Z,w},{X,y,Z,w},{X,y,Z,w},{X,y,Z,w},...]
----------------------------------------------------------------------------- */

using AoS_scene = std::vector<__m128>;

void print_scene(AoS_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.
        auto &&punned_data = reinterpret_cast<float const *>(scene.data());
        auto scene_size = std::size(scene);

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene_size, 8lu); ++j) {
                for(size_t i = 0lu; i < 4lu; ++i) {
                        printf("%10.3e ", punned_data[j * 4lu + i]);
                }
                printf("\n");
        }
        if(scene_size > 8lu) {
                printf("(%lu more)...\n", scene_size - 8lu);
        }
        printf("\n");
}

void normalize(AoS_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size(); i += 4lu) {
                __m128 vec = scene[i];
                __m128 dot = _mm_dp_ps(vec, vec, 255);
                __m128 norms = _mm_sqrt_ps(dot);
                scene[i] = _mm_div_ps(vec, norms);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        std::vector<__m128> scene(40'000);

        for(size_t i = 0lu; i < std::size(scene); ++i) {
                scene[i] = _mm_set_ps(randf(), randf(), randf(), randf());
        }

        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
                //break;
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

One major downside to interleaving on the vector-width is that you need to change the layout to take advantage of wider vectors. (AVX, AVX512). But yes, when you're manually vectorizing, this can be worth it if all your (important) loops always use all struct members. Otherwise Max's point applies: a loop that touches only `x` and `y` will waste bandwidth on the `z` and `w` members. — Peter Cordes, Feb 05 '19 at 00:00
`*(scene.xs + j)`. C++ has simpler syntax for that: `scene.xs[j]`. — Peter Cordes, Feb 05 '19 at 02:58
I do often write stuff like `_mm_load_ps(scene.xs + i)`, rather than `&scene.xs[i]`, but both are valid. (Of course in code I'm writing myself, I tend to use pointer increments in the C++, because that's usually what I want in the asm. Intel CPUs can only keep indexed addressing modes micro-fused in limited cases, not including any VEX load+ALU instructions. [Micro fusion and addressing modes](//stackoverflow.com/q/26046634)) — Peter Cordes, Feb 06 '19 at 08:42
@PeterCordes since the variables are pointers, it seems inconsistent to me, to pretend they are arrays in one place and then switch back to pointer arithmetic in another. Think of the newcomers, man! C++ is well confusing enough as it is. :) — , Feb 06 '19 at 11:27
I usually write mostly C-like code when optimizing. Like I said, I normally write code that does pointer increments, not pretending anything is an array. If you are trying to make your code look like arrays, then sure use `&arr[i]`, but that just looks noisy to me vs. `arr + i`. If the expression involves a memory access, then I have square brackets. Like `_mm_set1_ps(arr[i])`. If it doesn't (just calculating a pointer, even if I'm passing it to a load or store intrinsic), then no brackets. Same as asm with NASM syntax (err, if you ignore LEA :P). — Peter Cordes, Feb 06 '19 at 11:33