I have a NASM program that uses AVX-512. It runs very well on a single core, but with 4 cores it runs 10x slower than the single core version. Raising the core count does not make it run faster. I have profiled with VTune, Intel Advisor and Oprofile, and all three point to a single tight loop. My question is whether this loop could be avoided by sorting data at the end.
First the code for this loop, and then I'll explain what it does.
xor rbx,rbx
label_804D2:
; rdi=zmm7 index (byte offsets); rsi=zmm3 (data)
mov rcx,[rdi+rbx]
mov rax,[rsi+rbx]
mov [r8+rcx],rax
add rbx,8
cmp rbx,rdx
jl label_804D2
The loop takes 8-element arrays from two zmm registers and writes them to locations on the stack (pointed to by rsi and rdi) -- that step is because we can't access data elements directly from a zmm register in a loop because pextrq takes an immediate not a var.
Next we loop through the elements on both arrays. We write the integers in the second array to an 80-element qword array on the stack (pointed to by r8), using the data points in the first array as byte offsets to the locations in the 80-element array. For example:
80 480 560 0 0 0 0 0
5776 5776 6724 0 0 0 0 0
Here we write 5776 to byte 80; we write 5776 to byte 480 and we write 6724 to byte 560. Before we enter the loop we know that we have only three elements to write in this 8-element array (that's in rdx). This small loop is enclosed by a larger loop; on each iteration of the larger loop we continue to distribute values into the 80-element array until we have distributed all values into an array that looks like this:
5476 5776 6084 6400 6724 0 0 0 0 0 5776 6084 6400 6724 7056 0 0 0 0 0 4624 4900 5184 5476 5776 0 0 0 0 0 36 64 100 144 196 0 0 0 0 0 7056 7396 7744 8100 8464 0 0 0 0 0 676 784 900 1024 1156 0 0 0 0 0 5776 6084 6400 6724 7056 0 0 0 0 0 6724 7056 7396 7744 8100 0 0 0 0 0
Intel Advisor flags this loop due to the mixed stride (writing to 80, then jumping to 480, then 560); VTune and oprofile also both flag this area as the major time consumer.
To fix this problem, I propose to record the data sequentially then use the data to sort the 80-element array at the end of the enclosing loop. To do that I need to sort both arrays in tandem.
One Stack Overflow question showed a case where sorting input data prior to processing speeded the program up substantially because it reduced branch mispredictions from many to a few (see Why is processing a sorted array faster than processing an unsorted array?). My problem is not branch prediction, it's the long irregular stride.
All of the arrays are 64-byte aligned for maximum cache efficiency.
I know this question could be met with "why don't you try it and see" but it will be significant work to redesign this, so before I begin I wanted to see if anyone has input on the soundness of this approach in this situation.
I did not post reproducible code because I don't expect anyone else to do the work first, but I hope others can share any experience with similar situations.
Thanks very much.