3

I have written a short C "wrapper" function for an asm inline assembly, as below. The assembly code consists of a while loop, computing several vector dot product using SSE2. I am using GCC 4.8.4 on Ubuntu 14.04 on an x86. The following code can be assembled "without problem" under

gcc -fpic -O2 -msse2 -S foo.c

But when I do

gcc -c foo.s

an error is triggered:

foo.c: Assembler messages:
foo.c:2: Error: unknown pseudo-op: `.while5'

I checked the assembler ouput "foo.s" and found something strange.

C file "foo.c":

#include <emmintrin.h>

void foo (int kk, double *A, double *B, double *ALPHA, double *C, int ldc) {
   asm("movl %0, %%ecx\n\t"  /* kk -> %ecx */
       "movl %3, %%eax\n\t"  /* A -> %eax */
       "movl %4, %%edx\n\t"  /* B -> %edx */
       /* a while-loop */
       ".while%=\n\t"
       "movsd   (%%edx), %%xmm5\n\t"
       "unpcklpd %%xmm5, %%xmm5\n\t"
       "movapd  %%xmm5, %%xmm6\n\t"
       "movapd  (%%eax), %%xmm4\n\t"
       "mulpd   %%xmm4, %%xmm6\n\t"
       "movapd  16(%%eax), %%xmm7\n\t"
       "addl    $32, %%eax\n\t"
       "addpd   %%xmm6, %%xmm0\n\t"
       "mulpd   %%xmm7, %%xmm5\n\t"
       "addpd   %%xmm5, %%xmm1\n\t"
       "movsd   8(%%edx), %%xmm6\n\t"
       "addl    $16, %%edx\n\t"
       "unpcklpd %%xmm6, %%xmm6\n\t"
       "mulpd   %%xmm6, %%xmm4\n\t"
       "addpd   %%xmm4, %%xmm2\n\t"
       "mulpd   %%xmm6, %%xmm7\n\t"
       "addpd   %%xmm7, %%xmm3\n\t"
       "subl    $1, %%ecx\n\t"  /* kk-- */
       "testl   %%ecx, %%ecx\n\t"  /* kk = 0 ? */
       "jne .while%=\n\t"
        /* other input operands passing */
       "movl %5, %%ecx\n\t"  /* C -> %ecx */
       "movl %1, %%eax\n\t"  /* ALPHA -> %eax, then C0 -> %eax */
       "movl %2, %%edx\n\t"  /* ldc -> %edx */
       /* write-back */
       "movsd (%%eax), %%xmm7\n\t"
       "unpcklpd %%xmm7, %%xmm7\n\t"
       "leal (%%ecx,%%edx,8), %%eax\n\t"  /* C0=C+ldc */
       "mulpd %%xmm7, %%xmm0\n\t"
       "addpd (%%ecx), %%xmm0\n\t"
       "movapd %%xmm0, (%%ecx)\n\t"
       "mulpd %%xmm7, %%xmm2\n\t"
       "addpd (%%eax), %%xmm2\n\t"
       "movapd %%xmm2, (%%eax)\n\t"
       "mulpd %%xmm7, %%xmm1\n\t"
       "addpd 16(%%ecx), %%xmm1\n\t"
       "movapd %%xmm1, 16(%%ecx)\n\t"
       "mulpd %%xmm7, %%xmm3\n\t"
       "addpd 16(%%eax), %%xmm3\n\t"
       "movapd %%xmm3, 16(%%eax)\n\t"
       : /* no output operands */
       : "m"(kk), "m"(ALPHA), "m"(ldc), "m"(A), "m"(B), "m"(C)  /* input operands */
       : "eax", "edx", "ecx", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"  /* clobbers */ );
   }

assembler output (the while-loop looks odd!)

.LFB503:
            .cfi_startproc
 #APP
 # 4 "foo.c" 1
            movl 4(%esp), %ecx
            movl 8(%esp), %eax
            movl 12(%esp), %edx
            .while5
            movsd   (%edx), %xmm5
            unpcklpd %xmm5, %xmm5
            movapd  %xmm5, %xmm6
            movapd  (%eax), %xmm4
            mulpd   %xmm4, %xmm6
            movapd  16(%eax), %xmm7
            addl    $32, %eax
            addpd   %xmm6, %xmm0
            mulpd   %xmm7, %xmm5
            addpd   %xmm5, %xmm1
            movsd   8(%edx), %xmm6
            addl    $16, %edx
            unpcklpd %xmm6, %xmm6
            mulpd   %xmm6, %xmm4
            addpd   %xmm4, %xmm2
            mulpd   %xmm6, %xmm7
            addpd   %xmm7, %xmm3
            subl    $1, %ecx
            testl   %ecx, %ecx
            jne .while5
            movl 20(%esp), %ecx
            movl 16(%esp), %eax
            movl 24(%esp), %edx
            movsd (%eax), %xmm7
            unpcklpd %xmm7, %xmm7
            leal (%ecx,%edx,8), %eax
            mulpd %xmm7, %xmm0
            addpd (%ecx), %xmm0
            movapd %xmm0, (%ecx)
            mulpd %xmm7, %xmm2
            addpd (%eax), %xmm2
            movapd %xmm2, (%eax)
            mulpd %xmm7, %xmm1
            addpd 16(%ecx), %xmm1
            movapd %xmm1, 16(%ecx)
            mulpd %xmm7, %xmm3
            addpd 16(%eax), %xmm3
            movapd %xmm3, 16(%eax)      
# 0 "" 2
#NO_APP
            ret
            .cfi_endproc

Can anyone kindly refer to me what has happened? I don't think it is my compiler's problem. There must be something wrong with my code. Thx!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • 2
    You can't make while loops in assembly. – fuz Feb 03 '16 at 23:23
  • 3
    You are missing a colon to make it a label: `.while%=:`. Note that `-S` does not assemble, it just produces assembly output so it can contain errors too. – Jester Feb 03 '16 at 23:23
  • That is not a `while` loop, but simple labels and jumps. – too honest for this site Feb 03 '16 at 23:33
  • @AlphaBetaGamma Ah! That's supposed to be a label. Typically, you don't indent labels and if you make a local label on x86, it should begin with `.L`, so `.Lwhile:`. I thought that was an attempt to use a nonexisting pseudo-opcode named `.while`. – fuz Feb 04 '16 at 00:12
  • Naming the label `while` is misleading. Yours is a `do {} while()`-loop, which makes a *significant difference* whenever your program is called with a count of zero. Also, your input operands all being `"m"` is *extremely* suspicious. – EOF Feb 04 '16 at 00:30
  • Instead of `movsd` / `unpcklpd`, use SSE3 `movddup` to do a broadcast-load. You would probably also get better results from splitting this into two `asm` statements. One statement holding the loop, where you ask for the input params you're going to modify using stuff like `[kk] "+r"(kk)`, and with `__m128d C0_tmp` output operands like `[C0] "=x" (C0_tmp)`. Then do the epilogue / store part with intrinsics, or with another asm statement that takes those same vectors as input operands, and has memory output operands. That lets the compiler handle integer reg allocation, and will work for 64b – Peter Cordes Apr 04 '16 at 01:02
  • 1
    Any time you're writing `mov` instructions in inline asm, you should look for a way to let the compiler do it, so it can avoid unnecessary instructions in some cases. e.g. for x86-64, the args will already all be in integer registers, so it's silly to do a bunch of `mov reg,reg` instead of just letting the compiler choose regs for you. (see [the x86 tag wiki](http://stackoverflow.com/tags/x86/info) for a link to my inline asm examples of how to let the compiler do as much as possible.) – Peter Cordes Apr 04 '16 at 01:20
  • @AlphaBetaGamma: "the CPU" only does what it's told, so that makes no sense. Your inline asm looks correct. Since you declared clobbers on all the non-operand regs you use by name, the compiler will put things somewhere else, like in a register not clobbered or used for an operand by your inline asm. In your case, before the inline asm, the compile will store all 6 function args to memory, because your inline asm foolishly requires them to be memory operands, instead of using `"rmi"` constraints. (A `"+r"` constraint and then using the reg the compiler chose is best, of course). – Peter Cordes Apr 04 '16 at 15:05
  • Oh, you are missing a `"memory"` clobber, so the compiler isn't required to have done all the stores into the array that you loop over. When you access memory other than with a memory input or read-write operand, you need a `"memory"` clobber. The gcc docs have an example of using a struct as a dummy operand to tell the compiler which memory you modify, but a `"memory"` clobber (compiler memory barrier) is pretty cheap compared to a whole loop. – Peter Cordes Apr 04 '16 at 15:07
  • If you don't declare a `"memory"` clobber, at some point in the future your code is going to get inlined and optimized around, and something will break (e.g. a store to `C[i]` will be done after your inline asm, or a load from it will be done before). That's the thing with inline asm: if you get the constraints wrong, it can work perfectly in testing because gcc happens to do what the asm needs anyway. The memory clobber is a totally separate issue from not forcing the compiler to waste instructions getting the pointers operands into the asm. – Peter Cordes Apr 04 '16 at 15:17
  • @AlphaBetaGamma: post it on [codereview.SE] and tag me in a note here so I'll notice it. Err, if it's not working, then open a new question here on SO. This question needs to stay being about the `.while:` problem, to not invalidate the answer. – Peter Cordes Apr 04 '16 at 15:18

1 Answers1

3

Because your .while is not defined as a label, it's being seen as a [non-existent] pseudo-op.

Change:

".while%=\n\t"

Into:

".while%=:\n\t"

UPDATE:

Per your request.

A "pseudo-op" is [terminology for] an assember directive that doesn't correspond to an instruction.

Some examples:

.globl main to specify that the main label is a global variable.

.text to specify that what follows should be placed in the "text" segment (likewise for .data).

The . prefix is [generally] reserved for pseudo ops. That's why you got the Error: unknown pseudo-op: message.

If you had done "while%=\n\t" instead [still wrong because there was no : to denote a label], you would have gotten a different message: Error: no such instruction:

Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • You're welcome! Enjoy sse2/xmm, it's fun [_after_ it's been debugged :-)]. – Craig Estey Feb 04 '16 at 00:12
  • @CraigEstey Yeah! I recently made a function [79 times faster](https://www.reddit.com/r/asm/comments/43qma0) by implementing it in AVX assembly instead of C code. – fuz Feb 04 '16 at 01:07
  • @FUZxxl Wow! You're a man after my own heart! My cpu is too old for AVX but I've been itching to give it a try. I also commend you for having the sense to code this in asm [in a `.s`] instead of [like many do] try to stuff it into [`gcc`] inline asm. Finally, Intel is providing a rich enough instruction set to allow really good things (e.g. triple arg alone is worth the price of admission). – Craig Estey Feb 04 '16 at 01:19