So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers
(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).
Here is an example:
// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);