Optimizing SIMD histogram calculation

Question

I worked on a code that implements an histogram calculation given an opencv struct IplImage * and a buffer unsigned int * to the histogram. I'm still new to SIMD so I might not be taking advantage of the full potential the instruction set provides.

histogramASM:
xor rdx, rdx
xor rax, rax
mov eax, dword [imgPtr + imgWidthOffset]
mov edx, dword [imgPtr + imgHeightOffset]
mul rdx                                     
mov rdx, rax                                ; rdx = Image Size
mov r10, qword [imgPtr + imgDataOffset]     ; r10 = ImgData

NextPacket: 
mov rax, rdx
movdqu  xmm0, [r10 + rax - 16]
mov rcx,16                               ; 16 pixels/paq

PacketLoop:
pextrb  rbx, xmm0, 0                ; saving the pixel value on rbx
shl rbx,2
inc dword [rbx + Hist]
psrldq  xmm0,1
loop    PacketLoop

sub rdx,16
cmp rdx,0
jnz NextPacket
ret

On C, I'd be running these piece of code to obtain the same result.

imgSize = (img->width)*(img->height);
pixelData = (unsigned char *) img->imageData;

for(i = 0; i < imgSize; i++)
{
    pixel = *pixelData; 
    hist[pixel]++;
    pixelData++;
}

But the time it takes for both, measured in my computer with rdtsc(), is only 1.5 times better SIMD's assembler. Is there a way to optimize the code above and quickly fill the histogram vector with SIMD? Thanks in advance

That doesn't look like good use of simd, I am surprised you even got that much improvement. As for optimizing, at the very least, you can merge the `shl rbx, 2` into `inc dword [rbx*4 + Hist]` and you don't need the `cmp rdx, 0` either. The `loop` instruction should also be avoided. — Jester, Jun 21 '15 at 23:12
Thanks. Just modified that. I'm having a hard time also because I'm not that experienced with the assembler way. But it has to be on asm. — notsag2d, Jun 21 '15 at 23:16
Related: tiny histograms (like 4 buckets) can use `count[0] += (arr[i] == 0)` which you can vectorize with SIMD packed compares - [Micro Optimization of a 4-bucket histogram of a large array or list](https://stackoverflow.com/q/61122144) — Peter Cordes, Apr 10 '20 at 18:26

Ross Ridge · Accepted Answer · 2015-06-22T00:02:52.257

Like Jester I'm surprised that your SIMD code had any significant improvement. Did you compile the C code with optimization turned on?

The one additional suggestion I can make is to unroll your Packetloop loop. This is a fairly simple optimization and reduces the number of instructions per "iteration" to just two:

pextrb  ebx, xmm0, 0
inc dword [ebx * 4 + Hist]
pextrb  ebx, xmm0, 1
inc dword [ebx * 4 + Hist]
pextrb  ebx, xmm0, 2
inc dword [ebx * 4 + Hist]
...
pextrb  ebx, xmm0, 15
inc dword [ebx * 4 + Hist]

If you're using NASM you can use the %rep directive to save some typing:

%assign pixel 0
%rep 16
    pextrb  rbx, xmm0, pixel
    inc dword [rbx * 4 + Hist]
%assign pixel pixel + 1
%endrep

Whoa thank you so much. This actually improved it to almost 5 times better. The loop instruction must be clearly avoided. EDIT: Just achieved over 5 times better removing the mov rcx, 16 also. — notsag2d, Jun 22 '15 at 00:07

Optimizing SIMD histogram calculation

1 Answers1

Linked