vpcmpeqb in inline assembly

Question

Currently I am trying to move from using NASM, to using inline assembly in c, as this would make linking a lot easier in the future (especially with inlining). However, I can't get my vector instructions to play nicely. In Intel assembly, I was able to do the following:

vpcmpeqb    ymm0, [rdi]

This would read 32 bytes from rdi, compare with ymm0 and mark the equal bytes. With AT&T I tried doing the following in c inline asm but it just doesn't work, it keeps complaining about mismatched operand sizes (where %1 is the input as"r"(s)):

vpcmpeqb    %%ymm0, %%ymm0, (%1)

I am compiling on gcc version 9.2.1.

AT&T syntax puts the destination last. NASM `vcmpeqb [rdi], ymm0, ymm0` isn't legal for the same reason. You can use `objdump -d` to get AT&T disassembly of something you wrote in NASM, to help learn the syntax. — Peter Cordes, Apr 07 '20 at 14:02
**But seriously don't use inline asm in C for manual vectorization; use `__m256 v = _mm256_cmpeq_epi8` intrinsics.** https://gcc.gnu.org/wiki/DontUseInlineAsm You can typically get the compiler to generate asm about as good as what you could write by hand, and you don't have to worry about GNU C inline asm which is very hard to use. e.g. unless you had a `"memory"` clobber or dummy `"m"` input on that asm statement, asking for a pointer in a register is not safe. [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) — Peter Cordes, Apr 07 '20 at 14:02
@PeterCordes Thanks that did do the trick! I will also look out for inline assembly but intrinsics are such an ordeal IMO, they are even more tricky than inline assembly. — Harm Smits, Apr 07 '20 at 14:41
They have annoyingly long names to type, but the design model is mostly not terrible for most things. It's only tricky for some interaction with scalar. And it lets the compiler pick addressing modes to take advantage of its optimization capabilities, and makes your code somewhat more future-proof as well as portable across compilers. And most importantly, makes many subtle and hard-to-debug bugs impossible. Getting a constraint wrong can lead to code that works now, but some unrelated change could break it in the future. And even then asm always defeats constant-propagation optimizations. — Peter Cordes, Apr 07 '20 at 14:53
@PeterCordes Any good resources on intrinsics? I have absolutely no clue where to start. — Harm Smits, Apr 07 '20 at 15:03
@codam_hsmits This should be the first page to look at https://software.intel.com/sites/landingpage/IntrinsicsGuide. You can search for asm-mnemonics but also "browse" categories of different instructions. You can also look into the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" for which a nice (inofficial) online-copy is available here: https://www.felixcloutier.com/x86 — chtz, Apr 07 '20 at 15:13
https://stackoverflow.com/tags/sse/info has a few links, including https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX which is an intro tutorial using intrinsics. At the time I added it to the tag wiki, I thought it was decent. Since you already know asm, just understand that intrinsics like `_mm_add_epi32` correspond to asm instructions like `paddd` in the same way that the `+` operator corresponds to `add` - compilers can optimize to do it differently. see also [this re: what `__m128i` really is](https://stackoverflow.com/questions/52112605) and strict-aliasing. — Peter Cordes, Apr 07 '20 at 15:27
Thanks! Already got a bit of stuff working previously but let's see how far I can get this time. — Harm Smits, Apr 07 '20 at 15:43

score 1 · Accepted Answer · answered Apr 07 '20 at 15:45

1

AT&T uses a different order for operands. To fix your issue, you should use

vpcmpeqb    (%1), %%ymm0, %%ymm0

Also, read the thread under the conversation for more information about this topic.

answered Apr 07 '20 at 15:45

Harm Smits

437
3
10

vpcmpeqb in inline assembly

1 Answers1