6

I'm suffering GCC inline assembly on PowerPC. The program compiles fine with -g2 -O3, but fails to compile with -g3 -O0. The problem is, I need to observe it under the debugger so I need symbols without optimizations.

Here is the program:

$ cat test.cxx
#include <altivec.h>
#undef vector

typedef __vector unsigned char uint8x16_p;

uint8x16_p VectorFastLoad8(const void* p)
{
  long offset = 0;
  uint8x16_p res;
  __asm(" lxvd2x  %x0, %1, %2    \n\t"
        : "=wa" (res)
        : "g" (p), "g" (offset/4), "Z" (*(const char (*)[16]) p));
  return res;
}

And here's the error. (The error has existed since PowerPC vec_xl_be replacement using inline assembly, but I have been able to ignore it until now).

$ g++ -g3 -O0 -mcpu=power8 test.cxx -c
/home/test/tmp/ccWvBTN4.s: Assembler messages:
/home/test/tmp/ccWvBTN4.s:31: Error: operand out of range (64 is not between 0 and 31)
/home/test/tmp/ccWvBTN4.s:31: Error: syntax error; found `(', expected `,'
/home/test/tmp/ccWvBTN4.s:31: Error: junk at end of line: `(31),32(31)'

I believe this is the sore spot from the *.s listing:

#APP
 # 12 "test.cxx" 1
         lxvd2x  0, 64(31), 32(31)

There's some similar issues reported when using lwz, but I have not found one discussing problems with lxvd2x.

What is the problem and how do I fix it?


Here's the head of the *.s file:

$ head -n 40 test.s
        .file   "test.cxx"
        .abiversion 2
        .section        ".toc","aw"
        .align 3
        .section        ".text"
        .machine power8
.Ltext0:
        .align 2
        .globl _Z15VectorFastLoad8PKv
        .type   _Z15VectorFastLoad8PKv, @function
_Z15VectorFastLoad8PKv:
.LFB0:
        .file 1 "test.cxx"
        .loc 1 7 0
        .cfi_startproc
        std 31,-8(1)
        stdu 1,-96(1)
        .cfi_def_cfa_offset 96
        .cfi_offset 31, -8
        mr 31,1
        .cfi_def_cfa_register 31
        std 3,64(31)
.LBB2:
        .loc 1 8 0
        li 9,0
        std 9,32(31)
        .loc 1 12 0
        ld 9,64(31)
#APP
 # 12 "test.cxx" 1
         lxvd2x  0, 64(31), 32(31)

 # 0 "" 2
#NO_APP
        xxpermdi 0,0,0,2
        li 9,48
        stxvd2x 0,31,9
        .loc 1 13 0
        li 9,48
        lxvd2x 0,31,9

Here's the code generated at -O3:

$ g++ -g3 -O3 -mcpu=power8 test.cxx -save-temps -c
$ objdump --disassemble test.o | c++filt

test.o:     file format elf64-powerpcle

Disassembly of section .text:

0000000000000000 <VectorFastLoad8(void const*)>:
   0:   99 06 43 7c     lxvd2x  vs34,r3,r0
   4:   20 00 80 4e     blr
   8:   00 00 00 00     .long 0x0
   c:   00 09 00 00     .long 0x900
  10:   00 00 00 00     .long 0x0
jww
  • 97,681
  • 90
  • 411
  • 885
  • 3
    Forgive my ignorance... How is this question lacking a MCVE? I've got the problem isolated to a single function with 3 lines demonstrating the error. – jww Nov 01 '18 at 15:01
  • 6
    It is certainly not lacking MCVE. It is just some people have troubles understanding the question, so they decided to express this inability as a downvote/VTC. – SergeyA Nov 01 '18 at 15:03
  • 1
    Thanks @SergeyA. The more I think about it, it is probably one of those bot scripts using a puppet account running against my account. – jww Nov 01 '18 at 15:10
  • What asm gets generated with `-O3 -S` for the `lxvd2x` line ? – Paul R Nov 01 '18 at 15:25
  • @PaulR - `lxvd2x 34, 3, 0` is listed in the `*.s` file. `objdump` says `lxvd2x vs34,r3,r0` is the object file. Added to the question. – jww Nov 01 '18 at 15:38
  • you can use `-Og` to keep symbols for debuggin while still have some optimizations turned on – phuclv Nov 01 '18 at 16:17
  • Doesn't gcc have builtins for this? I see `LXVD2X` mentioned [here](https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html). – David Wohlferd Nov 01 '18 at 21:14
  • @David - Yes, but the problem is, the compilers load a vector which is endian-reversed on LE machines. Then, two more instructions are used after the load to reverse the vsx register. I have no idea why the compilers don't reverse the vector elements at compile time and just load the register with one instruction at runtime. – jww Nov 01 '18 at 21:21
  • @David - The net effect is, I've got BLAKE2B running at 3.54 cycles per byte (cpb) on big-endian systems; but it runs at 8.1 to 8.4 cpb on little-endian systems. BLAKE2B requires 65 loads, so I'm burning 3x65 insns for each execution of the compression function instead of 1x65. – jww Nov 01 '18 at 21:24
  • Sigh, It's always something. I assume you've already looked at the powerpc [constraints](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html)? It talks about using ‘m’ or ‘es’ instead of Z. Not sure how that could help, but it might be worth a shot. I assume you've already experimented with some of the other constraints instead of `g`? – David Wohlferd Nov 01 '18 at 21:32
  • @phuclv: `-Og` is separate from `-g`, isn't it? I thought`-Og` is just an optimization level that was supposed to be good for edit/compile/run cycles, but without the `-O0` behaviour of making every variable effectively `volatile` for fully consistent debugging, so things can stay in registers. So basically a variant of `-O1`. – Peter Cordes Nov 01 '18 at 22:35
  • 1
    @PeterCordes I though that `-Og` prevents variables from being optimized out as I've read somewhere. Looks like that's not the case [Variables optimized out with g++ and the -Og option](https://stackoverflow.com/q/31435771/995714) – phuclv Nov 02 '18 at 02:03

1 Answers1

5

The issue is that the generated asm has register+offset operands for RA and RB, but the lxvd2x instruction only takes direct register addresses (ie, no offsets).

It looks like you've got your constraints wrong there. Looking at the inline asm:

__asm(" lxvd2x  %x0, %1, %2    \n\t"
    : "=wa" (res)
    : "g" (p), "g" (offset/4), "Z" (*(const char (*)[16]) p));

Firstly, you have one output operand and three input operands (so four in total), but only three operands used in your template.

I'm assuming that your function reads directly from *p, and it doesn't clobber anything, so it looks like this is an unused operand for indicating a potential memory access (more on that below). We'll keep it simple for now; dropping it gives us:

__asm(" lxvd2x  %x0, %1, %2    \n\t"
    : "=wa" (res)
    : "g" (p), "g" (offset/4));

Compiling that, I still get an offset used for the RA and/or RB:

 lxvd2x  0, 40(31), 9    

Looking at the docs for the "g" constraint, we see:

'g':

Any register, memory or immediate integer operand is allowed, except for registers that are not general registers.

However, we can't provide a memory operand here; only a register (without offset) is allowed. If we change the constraint to "r":

 __asm(" lxvd2x  %x0, %1, %2    \n\t"
       : "=wa" (res)
       : "r" (p), "r" (offset/4));

For me, this compiles to a valid lxvd2x invocation:

 lxvd2x  0, 9, 10

- which the assembler happily accepts.

Now, as @PeterCordes has commented, this example no longer indicates that it may access memory, so we should restore that memory input dependency, giving:

 __asm(" lxvd2x  %x0, %1, %2    \n\t"
    : "=wa" (res)
    : "r" (p), "r" (offset/4), "m" (*(const char (*)[16]) p));

In effect, all we've done is alter the constraints from "g" to "r", forcing the compiler to use non-offset register operands.

Jeremy Kerr
  • 1,895
  • 12
  • 24
  • 1
    Beware that asking for pointers in registers does *not* imply that the pointed-to memory is also an input. Dead-store elimination and/or reordering of stores across the asm statement can break your code. So you either need a `"memory"` clobber (slow) or a dummy `"m"` operand that's unused by the asm template (except maybe in a comment so you can see what you got). This was a problem with the original code, too, I guess, but definitely worth mentioning here. – Peter Cordes Nov 02 '18 at 03:28
  • @PeterCordes yep, good point; that's likely what the initial (third) output operand was for. – Jeremy Kerr Nov 02 '18 at 03:56
  • Oh, yes, the `"Z" (*(const char (*)[16]) p))` *input* operand will do the trick nicely, assuming `"Z"` is something like `"m"`. I hadn't looked at the question, but it's fine. – Peter Cordes Nov 02 '18 at 04:00
  • 1
    @PeterCordes thanks for the feedback, I've edited the answer to suit. – Jeremy Kerr Nov 02 '18 at 04:10
  • That you very much. I found `"r"` worked too through fiddling, but I did not want to diverge from what I was told to use several months ago because I don't understand the Rube Goldberg machine very well. The tool is a complete mess to me. (I'd love to see them design a car, and see how screwed up they can make starting, steering and stopping. I would be amused for years). – jww Nov 02 '18 at 09:27
  • I also found `"m" (*(const char (*)[16])` makes a mess of the code. Performance drops by 3x. I no longer use it. The inline assembler is responsible for understanding what I write. If they can't get it right then they need to get the hell out of the way until they fix their broken tool. – jww Nov 02 '18 at 09:31
  • It may increase performance, but possibly at the cost of correctness; the compiler needs to know that the (opaque-to-the-compiler) inline asm accesses memory at [p,p+16], otherwise it is free to reorder operations around that asm. The assembler isn't aware of those dependencies, and can't do that itself - it's just turning the generated asm into binary instructions. – Jeremy Kerr Nov 03 '18 at 03:20