how to proper read data vertically from horizontal array?

Question

here's the declaration of the infrastructure i have from a SDK:

struct alignas(32) Input {
    union {
        float values[16] = {};
        float value;
    };
    
    // other members variables
}

std::vector<Input> myInputs;

const int numInputsA = 4;
const int numInputsB = 4;
const int numInputsC = 4;
const int numInputsD = 4;
const int numInputsE = 4;
myInputs.resize(numInputsA + numInputsB + numInputsC + numInputsD + numInputsE);

what's the best way to load records faster with simd such as:

__m128 targetA0 = { myInputs[0].values[0], myInputs[1].values[0], myInputs[2].values[0], myInputs[3].values[0] }
__m128 targetB0 = { myInputs[4 + 0].values[0], myInputs[4 + 1].values[0], myInputs[4 + 2].values[0], myInputs[4 + 3].values[0] }
__m128 targetC0 = { myInputs[8 + 0].values[0], myInputs[8 + 1].values[0], myInputs[8 + 2].values[0], myInputs[8 + 3].values[0] }
...
__m128 targetA1 = { myInputs[0].values[1], myInputs[1].values[1], myInputs[2].values[1], myInputs[3].values[1] }
__m128 targetB1 = { myInputs[4 + 0].values[1], myInputs[4 + 1].values[1], myInputs[4 + 2].values[1], myInputs[4 + 3].values[1] }
__m128 targetC1 = { myInputs[8 + 0].values[1], myInputs[8 + 1].values[1], myInputs[8 + 2].values[1], myInputs[8 + 3].values[1] }
...
... and so on

as you can see, the struct i've inherit is not really oriented to catch data this way, but can't change it.

so the question, thanks to your experience: is it possible to load data to register with "offset" on each starting index? or the cacheline need anyway to load the whole block, involing lots of cache miss?

maybe there's some tricks to speed up the whole thing. as for my previous post, still on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations (imposed by the ecosystem where i'm developing into).

thanks for any helps/tips/suggestions you can give to me.

I think I mentioned this on a previous question, but you probably want `-march=nocona -mtune=generic`, unless you actually care more about performance on P4 than on typical modern CPUs. It will still *run* on those old P4s, but tuning choices like when to inline and which instructions to use will be based on what's good for mainstream AMD and Intel CPUs. — Peter Cordes, Sep 07 '21 at 20:53
x86 doesn't have strided loads, but it could be worthwhile to do vector loads and shuffle to transpose, if you can use a pieces of 4x8 or 8x8 transposes, although with only 16 XMM regs that hold 4 floats each, you can't hold 12x 16 floats. — Peter Cordes, Sep 07 '21 at 20:56
@PeterCordes yes, already did in the past your suggestion about `-mtune=generic`, but I don't get any significative gain (less than 1%) — markzzz, Sep 08 '21 at 06:42
@PeterCordes well in fact I could "load" `values` for each Input (which is 16xFloat, thus 64 byte, which can be load with a single shot) horizontally, and than transpose each corrispettive index vertically. which is the best way to this "transposition"? any example? — markzzz, Sep 08 '21 at 06:48
You have AVX-512 to load 64 bytes at once? Or you mean you can do 4 loads and load a whole struct? Assuming the latter, yeah, you can do that, but then what? How many of those values are going to be part of outputs you can generate without running out of registers? A0 and A1 each only use one of those values, so values from 4 total sets of 4x 4-float vectors. IDK if it would be worth storing/reloading any partially-combined data; maybe — Peter Cordes, Sep 08 '21 at 07:03
As for SIMD transposes, google it, you'll find lots of examples of SSE / SSE2 / AVX transposes like 3x8, 4x4, 8x8. If you need the full set of outputs, you have a 20x16 transpose? A library like Eigen might help with optimized SIMD transposes. — Peter Cordes, Sep 08 '21 at 07:09
isn't CPU cache line reading 64byte every time? https://stackoverflow.com/questions/3928995/how-do-cache-lines-work this I mean: when I load myInputs[0].values it will put in cache the whole vector[16] at "once", right? — markzzz, Sep 08 '21 at 07:34
Yeah, cache works in 64-byte aligned chunks. Since your structs are only `alignas(32)`, they might be split across two cache lines, but you eventually touch all the data so it all becomes hot in cache. Doing more with the first cache line or two of data could help slightly if this data is cold when you get to it, giving HW prefetch time to pull in later lines before you run out of stuff to do with the first line. (Or more realistically, effectively hiding a few extra cycles of load latency, whichever level of cache the data is coming from.) — Peter Cordes, Sep 08 '21 at 08:24
More relevant is how many total uops it's going to take to get the work done, though. That's why it matters to try to do something useful with all 4 floats in one XMM register, or at least two of them, to do something more useful than gathering data for each output one element at a time. — Peter Cordes, Sep 08 '21 at 08:26
@PeterCordes so if I do 4 loads (each of cacheline 64byte), its probably that most of the data is in some cache location before I transpose it, right? I mean: load values[] from myInputs[0], than myInputs[1], than myInputs[2] and in the end myInputs[3] is (in the better case) 4 cache lines fetch. Once i've "moved" most of the data in cache (how much large are cache generally?), transpose should be fast. Right? — markzzz, Sep 08 '21 at 19:01
You're entirely missing the point. Manual prefetch isn't relevant, the important thing is how many instructions it takes to do the actual prefetch. If data is cold in cache to start with, memory pipelining and software prefetch will bring it in as the loads try to execute. Tuning your code for throughput, minimizing number of uops, will let OoO exec do a better job overlapping earlier and later code with this if any cache misses happen. — Peter Cordes, Sep 08 '21 at 19:52
@PeterCordes i do understand nothing from your last message :) i think i need to study more! — markzzz, Sep 08 '21 at 20:00

score 1 · Answer 1 · answered Sep 07 '21 at 19:06

1

The only marginal improvement might be to change alignas to 64, since you have 64 bytes, to hopefully make it align into a single cache line.

64 bytes happens to be the size of a cache line these days. So, assuming you need to fetch the data from RAM, your simd setup will hardly matter. The expensive part will be getting the data to L1 cache, the rest are of the operations will be noise. Even, if you need two cache lines because of alignment, I expect the increase to be very small. Keep in mind that today's processors don't execute things sequentially. Likely all those assignments run somewhat in parallel, so the catual order doesn't matter that much.

I would suggest getting a fairly simple version of your code (two loops) and look at the assembly code generated. You're running with O3 so even naive code will likely be fairly well (if not better) optimized. If you are serious about optimizing this you should setup a benchmark to verify that what you're doing is actually speeding things up. I would expect the simple version will be fast enough (please post if you're getting better results).

Also you should profile the entire application. It's likely you will find other bits of code that are easier to optimize and give you more benefits.

Can you get faster? Probably, but you start adding significant complexity and limitations to your code. I can imagine a situation where your code works fast on your workstation but just average on other CPUs. Also you will be complicating some non-trivial code. Is that worth it to you?

answered Sep 07 '21 at 19:06

Sorin

11,863
22
26

*The expensive part will be getting the data to L1 cache, the rest are of the operations will be noise.* - depends how recently the data was stored. Caches (often) work, that's why we have them. There might be considerable savings from SIMD shuffles for a 16 x N transpose, perhaps using smaller building blocks, compared to an L2 or even L3 hit, and those instructions will be stuck in the RS unexecuted until the load data arrives. So fewer uops means better OoO exec to hide cache miss latency better; one of the major reasons OoO exec is valuable. – Peter Cordes Sep 07 '21 at 20:58
If your data is typically cold in DRAM you've already lost, so yes, fix that first by cache-blocking your algorithm, then other optimizations become even more relevant. – Peter Cordes Sep 07 '21 at 21:00
@PeterCordes Caches do work, but it's hard to guarantee that everything will be in cache. Depending on the setup even a few cache misses can mess up the speed, but it's hard to argue how much without a benchmark. The question clearly specifies that it's compiling with `-O3`. So we're comparing optimized code vs hand-optimized code, so I'm not convinced that the savings can be considerable. If we were talking SIMD vs`-O0`, then you are absolutely right. – Sorin Sep 10 '21 at 19:14
Manual intrinsics with `-O0` are usually terrible since you end up using more separate statements, so more store/reload. Also, intrinsics being defined as wrapper functions means extra store/reload just for arg-passing even when they inline (with `-O0`). Your argument would apply with `-O1` or `gcc -O2` (GCC only includes `-ftree-vectorize` at `-O3`, unlike clang). Anyway, yes, check compiler output to see if it auto-vectorized this. If not, then there might be significant gains. – Peter Cordes Sep 10 '21 at 22:01

how to proper read data vertically from horizontal array?

1 Answers1