Vectors need to be processed in registers, so memcpy
can't possibly be useful here.
If auto-vectorization doesn't generate good code, the standard technique is to use vector intrinsics. If you can do what you need with ops that could compile to SIMD instructions on multiple architectures, then yeah, gcc vector syntax might be a good approach.
I tried out your first version with gcc 4.9.2. It generates exactly what you'd hope for, with 64bit AVX. (256bit load, vector and, store).
Without a -march
or anything, just using baseline amd64 (SSE2), it copies the input to a buffer on the stack, and loads from there. I think it's doing this in case of unaligned input/output buffers, instead of just using movdqu
. Anyway, it's really horrible slow code, and it would be way faster to do 8 bytes at a time in GP registers than this nonsense.
gcc -march=native -O3 -S v32ui_and.c
(on a Sandybridge (AVX without AVX2)):
.globl f
f:
cmpq %rsi, %rdi
jnb .L6
vmovdqa .LC0(%rip), %ymm1 # load a vector of 0x0f bytes
.p2align 4,,10
.p2align 3
.L3:
vandps (%rdi), %ymm1, %ymm0
addq $32, %rdi
vmovdqa %ymm0, (%rdx)
addq $32, %rdx
cmpq %rdi, %rsi
ja .L3
vzeroupper
.L6:
ret
Note the lack of scalar cleanup, or handling of unaligned data. vmovdqu
is as fast as vmovdqa
when the address is aligned, so it's a bit silly not to use it.
gcc -O3 -S v32ui_and.c
is weird.
.globl f
f:
.LFB0:
cmpq %rsi, %rdi
movdqa .LC0(%rip), %xmm0 # load a vector of 0x0f bytes
jnb .L9
leaq 8(%rsp), %r10
andq $-32, %rsp
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r10
.p2align 4,,10
.p2align 3
.L5:
movq (%rdi), %rax
addq $32, %rdi
addq $32, %rdx
movq %rax, -80(%rbp)
movq -24(%rdi), %rax
movq %rax, -72(%rbp)
movq -16(%rdi), %rax
movdqa -80(%rbp), %xmm1
movq %rax, -64(%rbp)
movq -8(%rdi), %rax
pand %xmm0, %xmm1
movq %rax, -56(%rbp)
movdqa -64(%rbp), %xmm2
pand %xmm0, %xmm2
movaps %xmm1, -112(%rbp)
movq -112(%rbp), %rcx
movaps %xmm2, -96(%rbp)
movq -96(%rbp), %rax
movq %rcx, -32(%rdx)
movq -104(%rbp), %rcx
movq %rax, -16(%rdx)
movq -88(%rbp), %rax
movq %rcx, -24(%rdx)
movq %rax, -8(%rdx)
cmpq %rdi, %rsi
ja .L5
popq %r10
popq %rbp
leaq -8(%r10), %rsp
.L9:
rep ret
So I guess you can't safely use gcc vector extensions if it's sometimes going to generate code this bad. With intrinsics, the simplest implementation would be:
#include <immintrin.h>
#include <stdint.h>
void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
__m256i mask = _mm256_set1_epi8(0x0f);
for (; begin < end; begin += 32, o+=32) {
__m256i s = _mm256_loadu_si256((__m256i*)begin);
__m256i d = _mm256_and_si256(s, mask);
_mm256_storeu_si256( (__m256i*)o, d);
}
}
This generates identical code to the gcc-vector version (compiled with AVX2). Note this uses VPAND
, not VANDPS
, so it requires AVX2.
With large buffers, it would be worth doing a scalar startup until either input or output buffer was aligned to 16 or 32 bytes, then the vector loop, then any scalar cleanup needed. With small buffers, just using unaligned loads/stores and a simple scalar cleanup at the end would be best.
Since you asked about strings specifically, if your strings are nul-terminated (implicit-length), you have to be careful when crossing page boundaries that you don't fault if the string ends before the end of a page, but your read spans the boundary.