How to convert this assembly code to intrinsic code?

Question

Below it seems like intrinsics, however, I am not familiar with intrinsic functions. Please help me to convert the real code. Especially, testFunc() is more ambiguous for me. I guess it is also for dot product of two float vectors, but, the labels Lrep and Lexit make me confuse. Please figure out clearly for me. And intrinsics are available for mobile processor?

void testFunc(int M, int N, int K, float* A, float* B, float* C)
{
    float *a;
    float *b = new float[K*N];
    float *pointb = B;
    float *bb;
    float *answer = C;
    float c[8];

    for (int j = 0, k; j < K; j++) {
        bb = b + j;
        for (k = N / 8; k > 0; k--) {
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
            *bb = *pointb++; bb += K;
        }
        for (k = N / 8 * 8; k < N; k++) {
            *bb = *pointb++; bb += K;
        }
    }

    int K8 = K / 8 * 8;

    for (int i = 0; i < M; i++) for (int k = 0; k < N; k++) {
        a = A + i * K;
        bb = b + k * K;
        __asm {
            mov             esi, K8;
            sub             esi, 8;
            shl             esi, 2;
            xor             edi, edi;
            mov             edx, a;
            mov             ebx, bb;
            vxorps          ymm3, ymm3, ymm3;
        Lrep:
            cmp             edi, esi;
            jg              Lexit;
            vmovups         ymm0, ymmword ptr[edx + edi];
            vfmadd231ps     ymm3, ymm0, ymmword ptr[ebx + edi];
            add             edi, 32;
            jmp             Lrep;
        Lexit:
            vmovups         ymmword ptr[c], ymm3;
        }

        for (int j = K8; j < K; ) {
            *c += *(a + j) * *(bb + j); j++;
        }

        *answer = (c[0] + c[1] + c[2] + c[3] + c[4] + c[5] + c[6] + c[7]);
        answer++;
    }
}

and

pA = A;
for (k = 0; k < K; k++) {
    pC = C;
    for (i = 0; i < M; i++) {
        pA = A + i * K + k;
        pB = B + k * N;
        for (j = N / 32; j > 0; j--) {
            _asm {
                mov             eax, pC;
                mov             ebx, pA;
                mov             ecx, pB;
                vmovups         ymm0, ymmword ptr[eax];
                vmovss          xmm1, dword ptr[ebx];
                vbroadcastss    ymm4, xmm1;
                vmovups         ymm2, ymmword ptr[ecx];
                vfmadd231ps     ymm0, ymm4, ymm2;
                vmovups         ymmword ptr[eax], ymm0;
            }
            pC += 8; pB += 8;
            _asm {
                mov             eax, pC;
                mov             ebx, pA;
                mov             ecx, pB;
                vmovups         ymm0, ymmword ptr[eax];
                vmovss          xmm1, dword ptr[ebx];
                vbroadcastss    ymm4, xmm1;
                vmovups         ymm2, ymmword ptr[ecx];
                vfmadd231ps     ymm0, ymm4, ymm2;
                vmovups         ymmword ptr[eax], ymm0;
            }
            pC += 8; pB += 8;
            _asm {
                mov             eax, pC;
                mov             ebx, pA;
                mov             ecx, pB;
                vmovups         ymm0, ymmword ptr[eax];
                vmovss          xmm1, dword ptr[ebx];
                vbroadcastss    ymm4, xmm1;
                vmovups         ymm2, ymmword ptr[ecx];
                vfmadd231ps     ymm0, ymm4, ymm2;
                vmovups         ymmword ptr[eax], ymm0;
            }
            pC += 8; pB += 8;
            _asm {
                mov             eax, pC;
                mov             ebx, pA;
                mov             ecx, pB;
                vmovups         ymm0, ymmword ptr[eax];
                vmovss          xmm1, dword ptr[ebx];
                vbroadcastss    ymm4, xmm1;
                vmovups         ymm2, ymmword ptr[ecx];
                vfmadd231ps     ymm0, ymm4, ymm2;
                vmovups         ymmword ptr[eax], ymm0;
            }
            pC += 8; pB += 8;
        }
        for (j = N / 32 * 32; j < N; j++) {
            *pC += *pA * *pB;
            pC += 1; pB += 1;
        }
    }
}

You can directly copy the assembly code into a `__asm` block. However your project architecture should be x86 as x64 is not supported. — seccpur, Jul 22 '19 at 03:07
Hi, seccpur. Thank you for your answer. I've already copied the assembly code in my code. My trouble is not about not working the assembly code but cannot debug correctly this inline assembly code. VS 2015 debugger skips these asm lines, therefore it points incorrect line. — Seyon, Jul 22 '19 at 03:20

Peter Cordes · Answer 1 · 2019-07-22T14:40:59.720

2 vector loads (from the same position in 2 arrays) feeding an FMA into a vector accumulator smells like a dot-product to me.

I didn't check the asm reference manual to see that the destination operand was the sum rather than 1 of the multiplicands, but that's the way that makes sense.

The triple-nested loop looks like a matrix multiplication. It broadcasts 1 input while doing a vector load from the other to feed an FMA, so probably it's generating a SIMD vector of results for an output row.

Using MSVC inline asm syntax for this is pretty bad; it can only accept inputs via memory operands so it forces a reload + store between each block of asm. If you're going to unroll, use one big asm statement and use displacements in the addressing modes.

IDK why the dot-produce loop is written inefficiently (with both a conditional and unconditional branch inside the loop), and not unrolled with multiple accumulators. Pretty much defeats the purpose of hand-coding in asm. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for how to use multiple accumulators to hide FMA latency. Or let clang do it for you when unrolling+vectorizing a pure C loop.

I also don't know why it doesn't horizontal-sum the result, but instead just stores it to memory with vmovups [c], ymm3. Seems pointless. I guess the caller has to reload from memory and sum, or you could declare the function as returning a __m256 vector and ignore the store.

Anyway, you can obviously write a dot-product in scalar C code, perhaps using fma(a[i], b[i], sum) from math.h to replicate the asm's behaviour of not rounding the temporary result.

Or copy the manual vectorization with intrinsics like sum = _mm256_fmadd_ps(_mm256_loadu_ps(a[i]), _mm256_loadu_ps(b[i]), sum); or something. (See Intel's intrinsics guide).

score 2 · Accepted Answer · answered Jul 22 '19 at 05:04

2

In intrinsics, it's this code repeated 4 times.

{
// vmovups         ymm0, ymmword ptr[eax];
__m256 tempC = _mm256_loadu_ps((float*)pC);

// vmovss          xmm1, dword ptr[ebx];
// vbroadcastss    ymm4, xmm1;
__m256 tempA = _mm256_set1_ps(*pA);

// vmovups         ymm2, ymmword ptr[ecx];
__m256 tempB = _mm256_loadu_ps((float*)pB);

// vfmadd231ps     ymm0, ymm4, ymm2;
__m256 result = _mm256_fmadd_ps(tempA, tempB, tempC);

// vmovups         ymmword ptr[eax], ymm0;
_mm256_storeu_ps(pC, result);
}

pC += 8; pB += 8;

Constantly broadcasting the same value from pA seems a bit redundant though.

answered Jul 22 '19 at 05:04

robthebloke

9,331
9
12

Thank you for your kind answer. Could you recommend materials to learn intrinsics? – Seyon Jul 22 '19 at 06:43
1

@Seyon Not really. It's the kind of thing where resources quickly become outdated. I just make use of the intel intrinsics guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=1711,1766,3413&text=_mm_&techs=AVX,AVX2 and godbolt (with the -mavx2 -mfma -O2 and -ffast-maths flags ). – robthebloke Jul 22 '19 at 08:23
@robthebloke and future readers: prefer `-march=haswell` or `-march=znver1` over `-mavx2 -mfma`. Especially with gcc, setting `-mtune=` to an Intel CPU with AVX2 + FMA rules out splitting `loadu` / `storeu` into multiple asm instructions. [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](//stackoverflow.com/q/52626726) If your data is actually aligned at runtime, it's a possibly-big downside. – Peter Cordes Jul 22 '19 at 16:43
One advantage of using intrinsics instead of asm is that the compiler can hoist the broadcast out of the loop. (At least if you use `float *restrict pC` or `pA` so the compiler knows that stores through `pC` won't affect values read from `pA`.) And yeah, I thought something seemed weird about that matmul asm when I was writing my answer, but I couldn't put my finger on it. Redundant loads explain why a quick skim didn't make it obvious whether it was looping down a column or across a row. Well spotted. – Peter Cordes Jul 23 '19 at 03:49

score 0 · Answer 3 · answered Jul 22 '19 at 02:58

I'll do the first couple of lines to get you started, but really, if you can't read the assembly you'll need to refer to the Intel CPU manual to be able to decipher it.

mov             esi, K8;
sub             esi, 8;
shl             esi, 2;
xor             edi, edi;
mov             edx, a;
mov             ebx, bb;
mov             esi, K8

copy the contents of K8 into esi
subtract 8 from the value in easi
shift left 2 bits of esi and the copy result into esi
apply xor operation to edi against edi (this will be 0 and the reason clear if you understand binary and how registers work)
copy contents of a into edx
copy contents of bb into ebx
copy contents of K8 into esi

From here you'll need to familiarise yourself with depending on where your knowledge is at, binary and basic cpu architecture and assembly language operands that are relevant to your problem. Once you can read each line, then you can decipher the blocks and finally the program.

Thank you for your kind answer. Then what about "vxorps ymm3, ymm3, ymm3;"? What should I learn more about this? — Seyon, Jul 22 '19 at 03:21
Even though I haven't come across the vxorps instruction before a quick google search was all that I needed to find the definition. I'm also new to SO but I'm pretty sure the idea is not for me just to answer everything for you but rather to push you in the right direction. Correct me if I'm wrong. — Ace, Jul 22 '19 at 03:37

How to convert this assembly code to intrinsic code?

3 Answers3