I worked on a code that implements an histogram calculation given an opencv struct IplImage * and a buffer unsigned int * to the histogram. I'm still new to SIMD so I might not be taking advantage of the full potential the instruction set provides.
histogramASM:
xor rdx, rdx
xor rax, rax
mov eax, dword [imgPtr + imgWidthOffset]
mov edx, dword [imgPtr + imgHeightOffset]
mul rdx
mov rdx, rax ; rdx = Image Size
mov r10, qword [imgPtr + imgDataOffset] ; r10 = ImgData
NextPacket:
mov rax, rdx
movdqu xmm0, [r10 + rax - 16]
mov rcx,16 ; 16 pixels/paq
PacketLoop:
pextrb rbx, xmm0, 0 ; saving the pixel value on rbx
shl rbx,2
inc dword [rbx + Hist]
psrldq xmm0,1
loop PacketLoop
sub rdx,16
cmp rdx,0
jnz NextPacket
ret
On C, I'd be running these piece of code to obtain the same result.
imgSize = (img->width)*(img->height);
pixelData = (unsigned char *) img->imageData;
for(i = 0; i < imgSize; i++)
{
pixel = *pixelData;
hist[pixel]++;
pixelData++;
}
But the time it takes for both, measured in my computer with rdtsc(), is only 1.5 times better SIMD's assembler. Is there a way to optimize the code above and quickly fill the histogram vector with SIMD? Thanks in advance