0

Recently I'm implementing a function with avx2 assembly, but after I finished it, I found that there is no performance improvement. The original pure C code cost three hundreds CPU cycles so as avx2 implementation. Why this happened? Is there any optimization room for my avx2 implementation? c code:

typedef struct {
  int32_t coeffs[N]  __attribute__((aligned(32)));
} poly;
typedef struct {
  poly vec[L];
} polyvecl;
typedef struct {
  poly vec[K];
} polyveck;
void prepare_s1_s2_table(uint64_t s_table[2*N], polyvecl *s1, polyveck *s2)
{
    uint32_t k,j;
    uint64_t temp;
    uint64_t mask_s = 0X0404040404040404;

    for(k=0; k<N; k++){
        for(j=0; j<L; j++)
        {
            temp = (uint64_t)(ETA + s1->vec[j].coeffs[k]);
            s_table[k+N] = (s_table[k+N]<<8) | (temp);
        }
        for(j=0; j<K; j++)
        {
            temp = (uint64_t)(ETA + s2->vec[j].coeffs[k]);
            s_table[k+N] = (s_table[k+N]<<8) | (temp);
        }
        s_table[k] = mask_s - s_table[k+N];  
    }

}

avx2 code:

//preprocessor macro
#if defined(__WIN32__) || defined(__APPLE__)
#define cdecl(s) _##s
#else
#define cdecl(s) s
#endif

.macro prepares1s2 off,off2
# load coeffs
vpmovsxdq      (\off)(%rsi),%ymm0  
vpmovsxdq      (1024+\off)(%rsi),%ymm1
vpmovsxdq      (2048+\off)(%rsi),%ymm2
vpmovsxdq      (3072+\off)(%rsi),%ymm3
vpmovsxdq      (\off)(%rdx),%ymm4
vpmovsxdq      (1024+\off)(%rdx),%ymm5
vpmovsxdq      (2048+\off)(%rdx),%ymm6
vpmovsxdq      (3072+\off)(%rdx),%ymm7

# add eta s1
vpaddq       %ymm0,%ymm15,%ymm8
vpaddq       %ymm1,%ymm15,%ymm9
vpaddq       %ymm2,%ymm15,%ymm10
vpaddq       %ymm3,%ymm15,%ymm11

# pack s1 for s_table[i+N]
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm9,%ymm8
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm10,%ymm8
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm11,%ymm8
vpsllq       $8,%ymm8,%ymm8

# add eta s2
vpaddq       %ymm4,%ymm15,%ymm9
vpaddq       %ymm5,%ymm15,%ymm10
vpaddq       %ymm6,%ymm15,%ymm11
vpaddq       %ymm7,%ymm15,%ymm12

# pack s2 for s_table[i+N]
vpor         %ymm8,%ymm9,%ymm8
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm10,%ymm8
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm11,%ymm8
vpsllq       $8,%ymm8,%ymm8
vpor         %ymm8,%ymm12,%ymm8

# pack eta-s1 eta-s2 for s_table[i]  
vpsubq       %ymm8,%ymm14,%ymm0


# store
vmovdqa      %ymm0,(\off2)(%rdi)
vmovdqa      %ymm8,(2048+\off2)(%rdi)
.endm



.global cdecl(prepare_s1s2_table_avx)
cdecl(prepare_s1s2_table_avx):
.p2align 5   //instructs the assembler to align the following instruction or data on a boundary that is a power of 2 and here is equal to 2^5, or 32 bytes

vpbroadcastq        _4xeta(%rip),%ymm15
vpbroadcastq        _4xmasks(%rip),%ymm14

prepares1s2       0,0
prepares1s2       16,32
prepares1s2       32,64
prepares1s2       48,96
prepares1s2       64,128
prepares1s2       80,160
prepares1s2       96,192
prepares1s2       112,224
prepares1s2       128,256
prepares1s2       144,288
prepares1s2       160,320
prepares1s2       176,352
prepares1s2       192,384
prepares1s2       208,416
prepares1s2       224,448
prepares1s2       240,480
prepares1s2       256,512
prepares1s2       272,544
prepares1s2       288,576
prepares1s2       304,608
prepares1s2       320,640
prepares1s2       336,672
prepares1s2       352,704
prepares1s2       368,736
prepares1s2       384,768
prepares1s2       400,800
prepares1s2       416,832
prepares1s2       432,864
prepares1s2       448,896
prepares1s2       464,928
prepares1s2       480,960
prepares1s2       496,992
prepares1s2       512,1024
prepares1s2       528,1056
prepares1s2       544,1088
prepares1s2       560,1120
prepares1s2       576,1152
prepares1s2       592,1184
prepares1s2       608,1216
prepares1s2       624,1248
prepares1s2       640,1280
prepares1s2       656,1312
prepares1s2       672,1344
prepares1s2       688,1376
prepares1s2       704,1408
prepares1s2       720,1440
prepares1s2       736,1472
prepares1s2       752,1504
prepares1s2       768,1536
prepares1s2       784,1568
prepares1s2       800,1600
prepares1s2       816,1632
prepares1s2       832,1664
prepares1s2       848,1696
prepares1s2       864,1728
prepares1s2       880,1760
prepares1s2       896,1792
prepares1s2       912,1824
prepares1s2       928,1856
prepares1s2       944,1888
prepares1s2       960,1920
prepares1s2       976,1952
prepares1s2       992,1984
prepares1s2       1008,2016



ret
anna
  • 39
  • 3
  • Put `.p2align 5` *before* the label so the NOPs don't have to execute. Also, use a loop; fully unrolling that much code is the opposite of helpful; the uop cache works very well. I'd have written this with intrinsics; compilers normally do a good job. – Peter Cordes Mar 23 '23 at 16:30
  • Also, check performance counters for 4k-aliasing like `ld_blocks_partial.address_alias`. You're probably ok there; you load and then store with the same offsets, and if all your arrays are aligned the same relative to page boundaries, you won't have a problem. Are there page faults as part of this? That could be dominating the run-time. Or even just cache-misses perhaps? But unlikely, despite your strided access, L1d should be associative enough to get hits. What CPU did you test this on? – Peter Cordes Mar 23 '23 at 16:31
  • Does your instrinsic implementation bring performance improvement? The cpu I test this on is rocketlake. – anna Mar 24 '23 at 02:17
  • I write avx2 intrinisc code too, but still slower than pure C. – anna Mar 24 '23 at 02:21
  • You haven't shown how you're benchmarking it. I expect this (or intrinsics) would be faster than scalar asm. But a compiler may already be auto-vectorizing your scalar C to asm like this, but with loops so it can hit in the uop cache. So I suspect that there's some benchmarking mistake, like testing this into newly-allocated memory with page faults, vs. testing pure C into memory that doesn't trigger page faults. – Peter Cordes Mar 24 '23 at 02:25
  • I change the compiler flags from -O3 to -O1, and now the avx2 is faster than pure C. But now I have another question: since compiler can optimize the C code, what's the meaning that I write the avx2 code? – anna Mar 24 '23 at 03:20
  • What do you mean "what's the meaning"? If the compiler can already auto-vectorize with a good strategy (efficient asm), there isn't any point in writing asm by hand. That's not always the case, and sometimes you do need intrinsics to get the compiler to make asm that isn't bad. Very rarely, it's worth actually writing asm by hand. Mostly that's useful for performance experiments while working on compilers, not for actual production code that you'll have to maintain in asm, since compilers usually do a good job when compiling intrinsics for x86-64. (Unlike sometimes for ARM 32-bit.) – Peter Cordes Mar 24 '23 at 03:40
  • I mean "what' the meaning" is what's the point, my English sucks, sorry~ – anna Mar 24 '23 at 06:22

1 Answers1

0

It really depends on how you use the instructions, and what you are measuring. The vector instructions are not magic and do not bring the free performance just by themselves.

There are too many possible reasons for not seeing the improvement, to list all of them here. Among them are:

  1. You underutilize the new ISA. Yes, SSE, AVX and AVX2 are all SIMD instructions, but are all the vector lanes equally loaded in your code? If not, they are no better than equivalent scalar instructions.
  2. Some other parts of your application have slowed down, which offsets the speed up of using AVX.
  3. Activation of AVX is known to lower processor's frequency. So, even if operations/cycle of your code may increase, the decrease in frequency may still bring your operations/second down so that no benefit is observed.

But without seeing the code, or even better, using a profiler, it is virtually impossible to make a right guess here.

Grigory Rechistov
  • 2,104
  • 16
  • 25
  • Another common reason for no speedup is that you were already bottlenecked on memory bandwidth. Or page faults, depending on how this was benchmarked. AVX2 256-bit integer instructions won't cause much if any frequency slowdown on Skylake and newer, maybe not even on Haswell. See [SIMD instructions lowering CPU frequency](https://stackoverflow.com/a/56861355) - all the instruction in the question should be 256-bit "light", and thus only require L0, same as scalar or 128-bit integer. – Peter Cordes Mar 23 '23 at 16:33