1

I have a NASM program that uses AVX-512. It runs very well on a single core, but with 4 cores it runs 10x slower than the single core version. Raising the core count does not make it run faster. I have profiled with VTune, Intel Advisor and Oprofile, and all three point to a single tight loop. My question is whether this loop could be avoided by sorting data at the end.

First the code for this loop, and then I'll explain what it does.

xor rbx,rbx
label_804D2:
; rdi=zmm7 index (byte offsets); rsi=zmm3 (data)
mov rcx,[rdi+rbx]
mov rax,[rsi+rbx]
mov [r8+rcx],rax
add rbx,8
cmp rbx,rdx
jl label_804D2

The loop takes 8-element arrays from two zmm registers and writes them to locations on the stack (pointed to by rsi and rdi) -- that step is because we can't access data elements directly from a zmm register in a loop because pextrq takes an immediate not a var.

Next we loop through the elements on both arrays. We write the integers in the second array to an 80-element qword array on the stack (pointed to by r8), using the data points in the first array as byte offsets to the locations in the 80-element array. For example:

   80 480 560 0 0 0 0 0
5776 5776 6724 0 0 0 0 0

Here we write 5776 to byte 80; we write 5776 to byte 480 and we write 6724 to byte 560. Before we enter the loop we know that we have only three elements to write in this 8-element array (that's in rdx). This small loop is enclosed by a larger loop; on each iteration of the larger loop we continue to distribute values into the 80-element array until we have distributed all values into an array that looks like this:

5476 5776 6084 6400 6724 0 0 0 0 0 5776 6084 6400 6724 7056 0 0 0 0 0 4624 4900 5184 5476 5776 0 0 0 0 0 36 64 100 144 196 0 0 0 0 0 7056 7396 7744 8100 8464 0 0 0 0 0 676 784 900 1024 1156 0 0 0 0 0 5776 6084 6400 6724 7056 0 0 0 0 0 6724 7056 7396 7744 8100 0 0 0 0 0

Intel Advisor flags this loop due to the mixed stride (writing to 80, then jumping to 480, then 560); VTune and oprofile also both flag this area as the major time consumer.

To fix this problem, I propose to record the data sequentially then use the data to sort the 80-element array at the end of the enclosing loop. To do that I need to sort both arrays in tandem.

One Stack Overflow question showed a case where sorting input data prior to processing speeded the program up substantially because it reduced branch mispredictions from many to a few (see Why is processing a sorted array faster than processing an unsorted array?). My problem is not branch prediction, it's the long irregular stride.

All of the arrays are 64-byte aligned for maximum cache efficiency.

I know this question could be met with "why don't you try it and see" but it will be significant work to redesign this, so before I begin I wanted to see if anyone has input on the soundness of this approach in this situation.

I did not post reproducible code because I don't expect anyone else to do the work first, but I hope others can share any experience with similar situations.

Thanks very much.

RTC222
  • 2,025
  • 1
  • 20
  • 53
  • Still worth a try, RTC222, isn't it? Do not hesitate to used Godbolt Explorer & call-graphs and other tools there, to support your claims & make them somehow reproducible : https://godbolt.org/z/s7bTrt – user3666197 Jul 03 '20 at 20:35
  • 3
    [Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/q/11227809) is *not* including the sorting time in the measured time, only saying that *if* you already have your data sorted, the linear scan gets faster. Sorting is pretty expensive for large arrays. But for small arrays a comparator sorting network can be somewhat fast. Unlikely to matter, though; out-of-order exec can overlap work across 8 loop iterations so memory-level parallelism can happen regardless of order. – Peter Cordes Jul 03 '20 at 21:10
  • 3
    If you have AVX512, why not just use a scatter instruction instead of doing it manually like this? – Peter Cordes Jul 03 '20 at 21:11
  • The problem I see with AVX512 scatter is that I will read 10 units of 8 qwords each (for a total of 80 qwords). Each set of 8 will be sorted, but the entire array will not be sorted. – RTC222 Jul 03 '20 at 21:32

0 Answers0