AVX2 code cannot be faster than gcc base optmization

Question

I am studying AVX by writing AVX code with inline assembly. In this case, I tried to implement AVX in a simple function. The function name I made is lower_all_chars_base.

Its behavior is: Apply logical OR to every single char in std::string with 0x20.

Let c be every single char in the input.
Assuming the input only contains chars in this set 'A' <= c && c <= 'Z'.

So the function will make the characters be lower case.

I tried to make the AVX version of the function, the store instruction was unaligned, and there was no speed up at all.

Then I thought, if the memory access is aligned, then it must be faster. After that, I tried to make the AVX version with aligned store, but still gcc base optimization -O3 beats up my vectorized code by hand. What am I doing wrong here?

Functions Summary

lower_all_chars_base simple function.
lower_all_chars_avx_aligned AVX2 aligned move version:

Process first unaligned memory with base operation.
Then process aligned memory part with AVX2 and aligned move.
Then the rest with base operation again.

lower_all_chars_avx_unaligned AVX2 unaligned move version:

Process the data with AVX2 and unaligned move
Then the rest with base operation.

Questions

Why does gcc base optimization -O3 beat up my optimization?
What am I doing wrong here?
What is the proper AVX operation to do this?

Benchmark Result

CPU: Intel(R) Xeon(R) CPU E5-2650 v4 (2.20GHz)
Microarchitecture: Broadwell

root@esteh:/tmp# g++ --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@esteh:/tmp# g++ -Wall -Wextra -std=c++2a -O3 test.cpp -o test
root@esteh:/tmp# nice -n -20 ./test
lower_all_chars_base
Min   =  0.00662300
Max   =  0.00793100
Avg   =  0.00717280
Total =  0.07172800

lower_all_chars_avx_aligned
Min   =  0.00650200
Max   =  0.00785100
Avg   =  0.00726220
Total =  0.07262200

lower_all_chars_avx_unaligned
Min   =  0.00623600
Max   =  0.00835000
Avg   =  0.00701360
Total =  0.07013600

Code

Edit: N - 1 for the memset.

Godbolt link: https://godbolt.org/z/a16cGK


#include <ctime>
#include <cstdio>
#include <cassert>
#include <cstring>
#include <iostream>
using std::string;

void lower_all_chars_base(string &str);
void lower_all_chars_avx_aligned(string &str);
void lower_all_chars_avx_unaligned(string &str);
void do_benchmark(std::string &x, void (*fx)(string &));
void mem_flush(const void *p, unsigned int allocation_size);

#define N (size_t)(1024u * 1024 * 40)

#define BENCHMARK(STR, FX) do { \
 puts(#FX); \
 do_benchmark(STR, FX); \
} while(0)

int main() {
  static char x[N];
  memset(x, 'A', N - 1);
  string a(x), b(x), c(x);
  
  BENCHMARK(a, lower_all_chars_base);
  BENCHMARK(b, lower_all_chars_avx_aligned);
  BENCHMARK(c, lower_all_chars_avx_unaligned);

  assert(a == b);
  assert(b == c);

  memset(x, 'a', N - 1);
  assert(memcmp(c.c_str(), x, N - 1) == 0);
}

void do_benchmark(std::string &x, void (*fx)(string &)) {
  const size_t n = 10;
  double min, max, avg, c, total = 0;
  for (size_t i = 0; i < n; i++) {
    clock_t time0 = clock();
    fx(x);
    clock_t time1 = clock();

    c = (double)(time1 - time0) / CLOCKS_PER_SEC;
    total += c;
    if (i == 0) {
      min = max = c;
    } else {
      if (c > max) max = c;
      if (c < min) min = c;
    }
    mem_flush(x.c_str(), x.size());
  }
  avg = total / (double)n;
  printf("Min   =  %.8f\n", min);
  printf("Max   =  %.8f\n", max);
  printf("Avg   =  %.8f\n", avg);
  printf("Total =  %.8f\n\n", total);
}

__attribute__((noinline))
void lower_all_chars_base(string &str) {
  char *cs = (char *)str.c_str();
  size_t len = str.size();
  while (len--) {
    *cs++ |= 0x20;
  }
}

static const uint64_t mask[] __attribute__((aligned(32))) = {
  0x2020202020202020ull, 0x2020202020202020ull,
  0x2020202020202020ull, 0x2020202020202020ull
};

__attribute__((noinline))
void lower_all_chars_avx_aligned(string &str) {
  char *cs = (char *)str.c_str();
  size_t len = str.size();

  /* Only use AVX for data bigger than 4K. */
  if (len > 4096) {
    /* Handle unaligned data from the head. */
    uint8_t n = (uintptr_t)cs & 0b11111u;
    for (uint8_t i = 0; i < n; i++) {
      *cs++ |= 0x20;
    }

    len -= n;

    /* Prevent AVX to process data beyond the array. */
    size_t vlen = len - 288;
    size_t j;

    /* Process the aligned memory with AVX. */
    asm volatile("vmovdqa %[mask], %%ymm0"::[mask]"m"(mask):"ymm0");
    for (j = 0; j < vlen; j += 288) {
      asm volatile(
        "vpor\t(%[cs],%[j]), %%ymm0, %%ymm1\n\t"
        "vpor\t32(%[cs],%[j]), %%ymm0, %%ymm2\n\t"
        "vpor\t64(%[cs],%[j]), %%ymm0, %%ymm3\n\t"
        "vpor\t96(%[cs],%[j]), %%ymm0, %%ymm4\n\t"
        "vpor\t128(%[cs],%[j]), %%ymm0, %%ymm5\n\t"
        "vpor\t160(%[cs],%[j]), %%ymm0, %%ymm6\n\t"
        "vpor\t192(%[cs],%[j]), %%ymm0, %%ymm7\n\t"
        "vpor\t224(%[cs],%[j]), %%ymm0, %%ymm8\n\t"
        "vpor\t256(%[cs],%[j]), %%ymm0, %%ymm9\n\t"
        "vmovdqa\t%%ymm1, (%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm2, 32(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm3, 64(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm4, 96(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm5, 128(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm6, 160(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm7, 192(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm8, 224(%[cs],%[j])\n\t"
        "vmovdqa\t%%ymm9, 256(%[cs],%[j])"
        :
        : [cs]"p"(cs), [j]"r"(j)
        : "memory", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5",
          "ymm6", "ymm7", "ymm8", "ymm9"
      );
    }
    asm volatile("vzeroupper":::
      "ymm0", "ymm1", "ymm2", "ymm3",
      "ymm4", "ymm5", "ymm6", "ymm7",
      "ymm8", "ymm9", "ymm10", "ymm11",
      "ymm12","ymm13","ymm14","ymm15"
    );
    cs  += j;
    len -= j;
  }

  /* Backup remaining elements from the AVX operation. */
  for (size_t i = 0; i < len; i++) {
    *cs++ |= 0x20;
  }
}

__attribute__((noinline))
void lower_all_chars_avx_unaligned(string &str) {
  char *cs = (char *)str.c_str();
  size_t len = str.size();

  /* Only use AVX for data bigger than 4K. */
  if (len > 4096) {
    size_t j;
    size_t vlen  = len - 288;
    asm volatile("vmovdqa %[mask], %%ymm0"::[mask]"m"(mask):"ymm0");
    for (j = 0; j < vlen; j += 288) {
      asm volatile(
        "vpor\t(%[cs],%[j]), %%ymm0, %%ymm1\n\t"
        "vpor\t32(%[cs],%[j]), %%ymm0, %%ymm2\n\t"
        "vpor\t64(%[cs],%[j]), %%ymm0, %%ymm3\n\t"
        "vpor\t96(%[cs],%[j]), %%ymm0, %%ymm4\n\t"
        "vpor\t128(%[cs],%[j]), %%ymm0, %%ymm5\n\t"
        "vpor\t160(%[cs],%[j]), %%ymm0, %%ymm6\n\t"
        "vpor\t192(%[cs],%[j]), %%ymm0, %%ymm7\n\t"
        "vpor\t224(%[cs],%[j]), %%ymm0, %%ymm8\n\t"
        "vpor\t256(%[cs],%[j]), %%ymm0, %%ymm9\n\t"
        "vmovdqu\t%%ymm1, (%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm2, 32(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm3, 64(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm4, 96(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm5, 128(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm6, 160(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm7, 192(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm8, 224(%[cs],%[j])\n\t"
        "vmovdqu\t%%ymm9, 256(%[cs],%[j])"
        :
        : [cs]"p"(cs), [j]"r"(j)
        : "memory", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5",
          "ymm6", "ymm7", "ymm8", "ymm9"
      );
    }
    asm volatile("vzeroupper":::
      "ymm0", "ymm1", "ymm2", "ymm3",
      "ymm4", "ymm5", "ymm6", "ymm7",
      "ymm8", "ymm9", "ymm10", "ymm11",
      "ymm12","ymm13","ymm14","ymm15"
    );
    cs  += j;
    len -= j;
  }

  /* Backup remaining elements from the AVX operation. */
  for (size_t i = 0; i < len; i++) {
    *cs++ |= 0x20;
  }
}

void mem_flush(const void *p, unsigned int allocation_size) {
  /* https://stackoverflow.com/a/43694725/7275114 */
  const size_t cache_line = 64;
  const char *cp = (const char *)p;
  size_t i = 0;
  if (p == NULL || allocation_size <= 0)
    return;

  for (i = 0; i < allocation_size; i += cache_line) {
    asm volatile("clflush (%0)"::"r"(&cp[i]):"memory");
  }
  asm volatile("sfence"::: "memory");
}

At 40 MB, you presumably don't fit inside L2/L3 cache and so everything is going to and from main memory, which is surely the bottleneck regardless of how efficiently the CPU is doing the arithmetic. All three versions seem very similar in speed and that's what I'd expect. — Nate Eldredge, Dec 22 '20 at 06:56
*Why does gcc base optimization -O3 beat up my optimization?* Have you looked at the assembly output on godbolt, and compared it with what you wrote? I'd have hoped that it's why you used godbolt :) *What is the proper AVX operation to do this?* Whatever the compiler did would be the baseline, although the compiler beats you without even using AVX — Kuba hasn't forgotten Monica, Dec 22 '20 at 07:03
Note that ORing is idempotent: you can OR the same byte multiple times without changing it further. So you can handle the unaligned start/end with an unaligned load that can partially overlap the first aligned vector. Same for the tail. (So the only special case is input arrays shorter than 16 or 32 bytes.) — Peter Cordes, Dec 22 '20 at 07:17
No need to use hand-written asm here (intrinsics are fine), but see [Looping over arrays with inline assembly](https://stackoverflow.com/q/34244185) - you can tell the compiler about memory you read and write so you don't need `volatile` and the `"memory"` clobber. Also, more importantly for performance, [Micro fusion and addressing modes](https://stackoverflow.com/a/31027695) - indexed addressing modes defeat micro-fusion for memory-source `vpor` even on Broadwell. Also, broadcast-load the mask with `vpbroadcastd` instead of a 32-byte constant. — Peter Cordes, Dec 22 '20 at 07:20
Also, 4k is a pretty high threshold for using AVX. I guess it's not crazy since you have an auto-vectorized SSE2 loop that you also use for cleanup, with the compiler doing a bunch of work to figure out whether to use the vectorized loop or not. (If it knew for sure the cleanup loop would be at most 31 iterations, it might not vectorize.) Often leaving misalignment handling to the hardware is best, especially if you expect your input buffers to often be aligned. Using `vmovdqu` makes the aligned case optimally fast, with only a small extra price for misaligned, modern GCC does that. — Peter Cordes, Dec 22 '20 at 07:24
I assume you are compiling your assembly code with optimisations enabled? Otherwise the code around it will be slower. Note that inline assembly prevents some optimisations, use intrinsics instead — Alan Birtles, Dec 22 '20 at 07:26
@AlanBirtles: we don't have to assume, the OP did actually show the `-O3` option in their command and in their godbolt link. And even mentions it in the text. — Peter Cordes, Dec 22 '20 at 07:28
RE: Peter's intrinsics comment, perhaps load a mask with `__m256i mask = _mm256_setr_epi64x(0x2020202020202020ull, 0x2020202020202020ull, 0x2020202020202020ull, 0x2020202020202020ull);`, then `__m256i a1 = _mm256_load_si256((__m256i*)(cs+j));` plus `__m256i b1 = _mm256_or_si256(mask, a1);` plus `_mm256_store_si256((__m256i*)(cs+j), b1);` (repeat 8 more times to match your existing code, adding in offsets). Maintainers of your code will much prefer supporting intrinsics to asm. And while it (obviously) won't be portable to other hw, it may support other compilers. GCC might even optimize it. — David Wohlferd, Dec 22 '20 at 09:47
@DavidWohlferd: intrinsics: `__m256i mask = _mm256_set1_epi8( 0x20 );`. Unfortunately GCC will usually expand that to a 32-byte `vmovdqa` load, instead of a 4 or 8-byte broadcast load like clang usually manages. https://godbolt.org/z/846z83. So for GCC, the only benefit to using inline asm is that you can compress your vector constants. (Generating on the fly with `vpcmpeqd` / `vpabsb` / `vpslld` could work, too, but I think compilers are usually correct to choose a load.) — Peter Cordes, Dec 22 '20 at 20:11
@PeterCordes _Often leaving misalignment handling to the hardware is best_. -- I don't really know how expensive accessing unaligned memory is. But it is clear that leaving it to the hardware will give extra price memory access. Let us say, I am considering the buffer is very large and the access to memory in each iteration is unaligned. The small extra price in each iteration might sum up to be expensive price, since the number of iterations will grow as the data grow bigger. Correct me if I am wrong. — Ammar Faizi, Dec 26 '20 at 09:46
So I think, I need to compare which is more expensive between **handle unaligned head and tail** and **penalty of accessing unaligned memory** in the particular case. — Ammar Faizi, Dec 26 '20 at 09:47
Yeah, if your buffers are large, or almost always misaligned, there can be something to gain from software alignment handling. However, if your arrays are so large that they miss in cache the cost of misalignment is often hidden by waiting for DRAM. (Replays of uops dependent on loads). Yeah it might be a couple % slower to be doing misaligned loads or stores in an AVX2 loop on current HW if you're missing in L3 cache. (Or 10 to 15% slower for an AVX512 loop where every single access is a cache-line split if misaligned). — Peter Cordes, Dec 26 '20 at 10:45
You get the largest overall effect (in %) when your data is hot in L1d so could be going *much* faster. Your best bet is to make sure your data *is* aligned most of the time like I said, and let hardware handle the few cases where it isn't. (Although efficient handling of a maybe-unaligned first vector that maybe overlaps, and a maybe-unaligned last vector that again may overlap, can be pretty cheap.) If you can't ensure your data is usually aligned, then it's maybe worth handling if the average size is much larger than the startup overhead. Naive scalar startup could suck a lot, though. — Peter Cordes, Dec 26 '20 at 10:49
Related: [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/q/45128763) covers the actual costs on Intel. — Peter Cordes, Dec 26 '20 at 10:51

Ammar Faizi · Accepted Answer · 2021-02-26T14:21:38.427

I tried to apply some suggestions in the comments.

Yet, I do not use intrinsics. Now the hand-coded Assembly version is approximately 1.02 times faster than gcc optimization with -O3 -mavx2 flags. It is not a significant speed up. But I learned a lot about inline assembly. I am still waiting for other answers, I hope there is a better answer than this.

`lower_all_chars_avx_aligned` function changes summary:

Use vbroadcastsd to load the mask, like how the clang does.
Use dummy memory output to replace the volatile and "memory" clobber in inline assembly.
Reduce the AVX2 threshold to 1024 bytes.
Handle unaligned head/tail with AVX2.
AVX2 loop is fully written in inline assembly.
Utilize ymm0 to ymm15 registers.
Add __AVX__ and __AVX2__ constant checking to prevent vzeroupper emitted twice.

Benchmark changes summary:

The data to be processed is 3.5 GB in length (it was only 40 MB).
Run each function 30 times (it was only 10 times).
Added -mavx2 to improve gcc base optimization.
lower_all_chars_avx_unaligned is removed.
Use heap from malloc instead of static char[] variable to handle bigger data.
Run on Skylake microarchitecture (it was on Broadwell).

Benchmark info:

Min is the minimum time from 30 times function call.
Max is the maximum time from 30 times function call.
Avg is the average time from 30 times function call.
Total is the total time from 30 times function call.

Benchmark result:

CPU: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Microarchitecture: Skylake

root@yukii-hpc2:/tmp# g++ -Wall -Wextra -std=c++2a -O3 -mavx2 test.cpp -o test
root@yukii-hpc2:/tmp# nice -n -20 ./test
lower_all_chars_avx_aligned
Min   =  0.31086600
Max   =  0.31319800
Avg   =  0.31159833
Total =  9.34795000

lower_all_chars_base
Min   =  0.31823400
Max   =  0.32902100
Avg   =  0.31904893
Total =  9.57146800

root@yukii-hpc2:/tmp# g++ --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@yukii-hpc2:/tmp#

Code

Godbolt link: https://godbolt.org/z/ox7hjT

/*
  https://stackoverflow.com/questions/65404362/avx2-code-cannot-be-faster-than-gcc-base-optmization
*/

#include <ctime>
#include <cstdio>
#include <cassert>
#include <cstring>
#include <cstdlib>
#include <iostream>
using std::string;

void lower_all_chars_base(string &str);
void lower_all_chars_avx_aligned(string &str);
void do_benchmark(string &x, void (*fx)(string &));
void mem_flush(const void *p, unsigned int allocation_size);

#define _M(N) (size_t)(1024ull * 1024 * (N))
#define _G(N) (size_t)(1024ull * 1024 * 1024 * (N))

#define N (_G(3) + _M(512) + 1234) /* 3.5 G + 1234 */
/*
   1234 is just to make it odd,
   so it likely will jump to the tail of AVX aligned loop.
*/

#define BENCHMARK(STR, FX) do { \
 puts(#FX); \
 do_benchmark(STR, FX); \
} while(0)

int main() {
  char *x = (char *)malloc(N + 1);
  memset(x, 'A', N);
  x[N] = '\0';

  {
    string a(x);
    memset(x, 'a', N);
    BENCHMARK(a, lower_all_chars_avx_aligned);
    assert(memcmp(a.c_str(), x, N) == 0);
  }

  /* Restore value for the next benchmark. */
  memset(x, 'A', N);

  {
    string a(x);
    memset(x, 'a', N);
    BENCHMARK(a, lower_all_chars_base);
    assert(memcmp(a.c_str(), x, N) == 0);
  }

  free(x);
}

inline static void lower_all_chars_b1024(char *cs, uint16_t len) {
  while (len--) {
    *cs++ |= 0x20;
  }
}

/* Aligned memory for mask for performance. */
static const uint64_t mask[] __attribute__((aligned(32))) = {
  0x2020202020202020ull
};

__attribute__((noinline))
void lower_all_chars_avx_aligned(string &str) {
  char   *cs = (char *)str.c_str();
  size_t len = str.size();

  /* Only use AVX for data bigger than or equal to 1K. */
  if (len >= 1024) {

    size_t    avx_size  = 0x1e0; /* Bytes per AVX main iteration. */
    char      *end      = &(cs[len]);

    /* End of aligned process iteration. */
    char      *end_avx  = &(end[-avx_size]);

    /* Dummy variable, to let the compiler choose the best GP register. */
    uintptr_t rem_bytes;

    asm(
      /* Prepare %[rem_bytes] initial value. */
      "movq\t%[end], %[rem_bytes]\n\t"

      /* Load the mask. */
      "vbroadcastsd\t%[mask], %%ymm0\n\t"

      /* Handle unaligned memory from the head. */
      "vpor\t(%[cs]), %%ymm0, %%ymm1\n\t"
      "vmovdqu\t%%ymm1, (%[cs])\n\t"
      "addq\t$0x20, %[cs]\n\t" /* Move to the next 32 bytes. */


      /* Handle aligned memory part.

        Use `vmovdqa` to make sure that the memory is
        aligned properly.

        Note that ORing is idempotent: you can OR the same
        byte multiple times without changing it further. So
        %[cs] can partially overlap with `vmovdqu` operation
        before this point.

        https://stackoverflow.com/questions/65404362/avx2-code-cannot-be-faster-than-gcc-base-optmization#comment115632279_65404362
      */
      "andq\t$~0b11111ull, %[cs]\n\t" /* Clear 5-bit LSB. */
      "1:\n\t"

      "vpor\t0x000(%[cs]), %%ymm0, %%ymm1\n\t"
      "vpor\t0x020(%[cs]), %%ymm0, %%ymm2\n\t"
      "vpor\t0x040(%[cs]), %%ymm0, %%ymm3\n\t"
      "vpor\t0x060(%[cs]), %%ymm0, %%ymm4\n\t"
      "vpor\t0x080(%[cs]), %%ymm0, %%ymm5\n\t"
      "vpor\t0x0a0(%[cs]), %%ymm0, %%ymm6\n\t"
      "vpor\t0x0c0(%[cs]), %%ymm0, %%ymm7\n\t"
      "vpor\t0x0e0(%[cs]), %%ymm0, %%ymm8\n\t"
      "vpor\t0x100(%[cs]), %%ymm0, %%ymm9\n\t"
      "vpor\t0x120(%[cs]), %%ymm0, %%ymm10\n\t"
      "vpor\t0x140(%[cs]), %%ymm0, %%ymm11\n\t"
      "vpor\t0x160(%[cs]), %%ymm0, %%ymm12\n\t"
      "vpor\t0x180(%[cs]), %%ymm0, %%ymm13\n\t"
      "vpor\t0x1a0(%[cs]), %%ymm0, %%ymm14\n\t"
      "vpor\t0x1c0(%[cs]), %%ymm0, %%ymm15\n\t"

      /* Plug the result to aligned memory.  */
      "vmovdqa\t%%ymm1, 0x000(%[cs])\n\t"
      "vmovdqa\t%%ymm2, 0x020(%[cs])\n\t"
      "vmovdqa\t%%ymm3, 0x040(%[cs])\n\t"
      "vmovdqa\t%%ymm4, 0x060(%[cs])\n\t"
      "vmovdqa\t%%ymm5, 0x080(%[cs])\n\t"
      "vmovdqa\t%%ymm6, 0x0a0(%[cs])\n\t"
      "vmovdqa\t%%ymm7, 0x0c0(%[cs])\n\t"
      "vmovdqa\t%%ymm8, 0x0e0(%[cs])\n\t"
      "vmovdqa\t%%ymm9, 0x100(%[cs])\n\t"
      "vmovdqa\t%%ymm10, 0x120(%[cs])\n\t"
      "vmovdqa\t%%ymm11, 0x140(%[cs])\n\t"
      "vmovdqa\t%%ymm12, 0x160(%[cs])\n\t"
      "vmovdqa\t%%ymm13, 0x180(%[cs])\n\t"
      "vmovdqa\t%%ymm14, 0x1a0(%[cs])\n\t"
      "vmovdqa\t%%ymm15, 0x1c0(%[cs])\n\t"

      "addq\t%[avx_size], %[cs]\n\t"
      "cmpq\t%[end_avx], %[cs]\n\t"
      "jb\t1b\n\t"

      "subq\t%[cs], %[rem_bytes]\n\t"

      /* Now, %[rem_bytes] contains the remaining bytes. */
      "testq\t%[rem_bytes], %[rem_bytes]\n\t"
      "jz\t3f\n\t"
      /* There's no remaining bytes if `jz` is taken. */


      /* Handle the tail, may be back off several bytes
         to make the remaining bytes to be multiple of 32.
       */
      "leaq\t0b11111(%[rem_bytes]), %[dec_avx]\n\t"
      "andq\t$~0b11111ull, %[dec_avx]\n\t"
      "subq\t%[rem_bytes], %[dec_avx]\n\t"
      "subq\t%[dec_avx], %[cs]\n\t"

      "2:\n\t"
      "vpor\t(%[cs]), %%ymm0, %%ymm1\n\t"
      "vmovdqu\t%%ymm1, (%[cs])\n\t"
      "addq\t$0x20, %[cs]\n\t"
      "cmpq\t%[end], %[cs]\n\t"
      "jb\t2b\n\t"

      "3:\n\t"
      #if !defined(__AVX__) && !defined(__AVX2__)
      "vzeroupper"
      #endif
      /* Output */
      : [cs]"+r"(cs),
        [end]"+r"(end),
        [end_avx]"+r"(end_avx),
        [dec_avx]"=r"(end_avx), /* May reuse end_avx if needed. */
        [rem_bytes]"=r"(rem_bytes),

        /* Tell the compiler that this inline assembly is
           going to read/write `len` bytes from `cs`. */
        [dummy_mem_output]"+m"(*(char (*)[len])cs)

      /* Input */
      : [mask]"m"(mask),
        [avx_size]"n"(avx_size)

      /* Clobbers */
      : "ymm0", "ymm1", "ymm2", "ymm3",
        "ymm4", "ymm5", "ymm6", "ymm7",
        "ymm8", "ymm9", "ymm10", "ymm11",
        "ymm12", "ymm13", "ymm14", "ymm15"
    );
  } else {
    /* Let the compiler use its own optimization here. */
    lower_all_chars_b1024(cs, len);
  }
}

__attribute__((noinline))
void lower_all_chars_base(string &str) {
  char *cs = (char *)str.c_str();
  size_t len = str.size();
  while (len--) {
    *cs++ |= 0x20;
  }
}

void do_benchmark(string &x, void (*fx)(string &)) {
  const size_t n = 30;
  double min = 0, max = 0, avg, c, total = 0;
  for (size_t i = 0; i < n; i++) {

    mem_flush(x.c_str(), x.size());

    clock_t time0 = clock();
    fx(x);
    clock_t time1 = clock();

    c = (double)(time1 - time0) / CLOCKS_PER_SEC;
    total += c;
    if (i == 0) {
      min = max = c;
    } else {
      if (c > max) max = c;
      if (c < min) min = c;
    }
  }
  avg = total / (double)n;
  printf("Min   =  %.8f\n", min);
  printf("Max   =  %.8f\n", max);
  printf("Avg   =  %.8f\n", avg);
  printf("Total =  %.8f\n\n", total);
}

void mem_flush(const void *p, unsigned int allocation_size) {
  /* https://stackoverflow.com/a/43694725/7275114 */
  const size_t cache_line = 64;
  const char *cp = (const char *)p;
  size_t i = 0;
  if (p == NULL || allocation_size <= 0)
    return;

  for (i = 0; i < allocation_size; i += cache_line) {
    asm volatile("clflush (%0)"::"r"(&cp[i]):"memory");
  }
  asm volatile("sfence"::: "memory");
}

"Yet, I do not use intrinsics" - Might be worth taking a [look](https://godbolt.org/z/dGGn95). I don't know that the performance is any different (hard to say on godbolt), but looking at what the compiler does might give you some ideas. Ignore the other attempts in that link. `lower_all_chars_avx_aligned4` is what you want to look at. As for this code, it looks like you're processing 480 bytes per loop. So your "tail" could be 479 bytes. That's a lot of unaligned reads/writes. You might also want to look at interleaving your instructions (or/mov/or/mov). Intel's iaca might help with that. — David Wohlferd, Dec 25 '20 at 22:51
@DavidWohlferd Thanks for the insight. I am exploring that. Also, yeah I think the tail is worth to be improved. I may edit the answer later if I achieve the better result. — Ammar Faizi, Dec 26 '20 at 09:13

score 0 · Answer 2 · answered Apr 16 '22 at 16:04

from personal experience, your solution needs to learn more about the overall architecture of the processor you are using -- specifically the L1 cache line size, and what triggers the load of the L1 cache lines. try writing and benchmarking a read-only loop [such as sum_of_bytes, or strlen] first instead of a read-modify-write, and optimize the read-only loop. what you will find is that your code shown above is stalling each and every time it crosses a cache line boundary ... waiting for the data to transfer from the next level cache (L1, L2, L3) to where your code needs it to be. there can be similar stalls at 4kB or 64kB boundaries depending on the virtual memory page size used by your operating system and processor. Each of those potential stalls can be hidden from the runtime of your code if you provide processor "prefetch" hints that "at the end of the current inner loop, we will want the data at cursor + 1024 [or suitable offset for caching or paging]". Also, limit the inner loop size to under 1024 micro-ops to allow full use of the CPU instruction decode pipeline. Additionally, once certain minimum size of input/output buffer has been reached, it is actually worthwhile to multi-thread the code and use parallel processing -- there are tradeoffs, such as time to set up the loops for each thread, and the NUMA affinity of a thread to the data buffers. Overall, not an easy problem, and usually not worth the effort unless you are highly optimizing one CPU model for some sort of embedded application or want to shine on an HPC benchmark.

score 0 · Answer 3 · answered Jul 13 '23 at 12:53

What am I doing wrong here?

Nothing. But the loop is so trivial that performance is bottle-necked by memory. Try commenting out all calls to vpor in your code. The performance difference will be negligible which suggests that the speed with which you can read and write from memory is the limit.

Here is the critical loop for lower_all_chars_base emitted by gcc according to Compiler Explorer:

.L4:
    vpor    ymm0, ymm1, YMMWORD PTR [rdx]
    add     rdx, 32
    vmovdqu8    YMMWORD PTR [rdx-32], ymm0
    cmp     rcx, rdx
    jne     .L4

The only differences (save for the AT&T assembler style) are that you have unrolled the loop nine times and are using aligned stores. Neither of which improves performance in this case. Using unaligned store instructions to write to aligned memory doesn't incur any penalties and neither does the branch since it is easy to predict.

What is the proper AVX operation to do this?

Using intrinsics over inline assembly is very much preferable as it gives the compiler more opportunities to optimize the code.

On my machine your code runs at 12.5GB/s which I believe is close to the maximum memory bandwidth for combined read/writes possible per core. So the only way to significantly improve performance would be to use more cores (threads).

AVX2 code cannot be faster than gcc base optmization

Functions Summary

Questions

Benchmark Result

Code

3 Answers3

lower_all_chars_avx_aligned function changes summary:

Benchmark changes summary:

Benchmark info:

Benchmark result:

Code

`lower_all_chars_avx_aligned` function changes summary: