here's the declaration of the infrastructure i have from a SDK:
struct alignas(32) Input {
union {
float values[16] = {};
float value;
};
// other members variables
}
std::vector<Input> myInputs;
const int numInputsA = 4;
const int numInputsB = 4;
const int numInputsC = 4;
const int numInputsD = 4;
const int numInputsE = 4;
myInputs.resize(numInputsA + numInputsB + numInputsC + numInputsD + numInputsE);
what's the best way to load records faster with simd such as:
__m128 targetA0 = { myInputs[0].values[0], myInputs[1].values[0], myInputs[2].values[0], myInputs[3].values[0] }
__m128 targetB0 = { myInputs[4 + 0].values[0], myInputs[4 + 1].values[0], myInputs[4 + 2].values[0], myInputs[4 + 3].values[0] }
__m128 targetC0 = { myInputs[8 + 0].values[0], myInputs[8 + 1].values[0], myInputs[8 + 2].values[0], myInputs[8 + 3].values[0] }
...
__m128 targetA1 = { myInputs[0].values[1], myInputs[1].values[1], myInputs[2].values[1], myInputs[3].values[1] }
__m128 targetB1 = { myInputs[4 + 0].values[1], myInputs[4 + 1].values[1], myInputs[4 + 2].values[1], myInputs[4 + 3].values[1] }
__m128 targetC1 = { myInputs[8 + 0].values[1], myInputs[8 + 1].values[1], myInputs[8 + 2].values[1], myInputs[8 + 3].values[1] }
...
... and so on
as you can see, the struct i've inherit is not really oriented to catch data this way, but can't change it.
so the question, thanks to your experience: is it possible to load data to register with "offset" on each starting index? or the cacheline need anyway to load the whole block, involing lots of cache miss?
maybe there's some tricks to speed up the whole thing.
as for my previous post, still on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations
(imposed by the ecosystem where i'm developing into).
thanks for any helps/tips/suggestions you can give to me.