how do I optimize this loop further?

Question

Well, this post is the reason for Why is memcpy() and memmove() faster than pointer increments?

I have a custom image structure with 4 channels and I like to extract 3 of them (RGB) into an openCV cv::MAt strcuture with 3 channels.

cv::Mat GetColorImgFromFrame(Image* pimg)
{
    int iHeight = pimg->m_imageInfo.m_height;
    int iWidth = pimg->m_imageInfo.m_width;

    cv::Mat cvmColorImg = cv::Mat::zeros(iHeight, iWidth, CV_8UC3);
    int iDestStep = cvmColorImg.channels(); // 3

    uchar* pucSrc = pimg->m_data;    
    uchar* pucDest = cvmColorImg.data;
    uchar* pucDestLimit = cvmColorImg.data + iHeight*iWidth*iDestStep;

    for (;pucDest < pucDestLimit;)
    {       
        *(pucDest++) = *(pucSrc++);
        *(pucDest++) = *(pucSrc++);
        *(pucDest++) = *(pucSrc++);
        pucSrc++;
    }

    return cvmColorImg;
}

It doesn't seem much helpful to replace the three inner assignments with memcpy(pucDest, pucSrc, 3). Any suggestions?

Your code has syntax errors in C. I suggest you do not try to write multi-language source files: it is really hard! — pmg, Oct 15 '11 at 09:01
Just curious, why do you use a `for` loop, but reduce it to what is essentially a `while` loop? — Peter, Oct 15 '11 at 09:03
@Peter: well, I was messing around with the loop. It was originally a for loop but became what it is right now after several iterations.:) — wanderer, Oct 15 '11 at 09:12
@Martin: No because the src is a 4-channel matrix but the dest is 3-channel. — wanderer, Oct 15 '11 at 09:15
@codehacker: the `::` has no meaning in `C`. I believe they are a `C++` construct. To create a multi-language source file, you can't use them ... or `&` in function parameters, or `<<` meaning something other than "left shift", ... — pmg, Oct 15 '11 at 09:16
Surprised that the mandatory question has not been asked: did you determine that this is actually a performance bottleneck? It seems to me that your average OpenCV operation is going to be a lot slower than just this copying some bytes around. — Thomas, Oct 15 '11 at 11:22

score 3 · Answer 1 · answered Oct 15 '11 at 09:11

On a 32 bit machine, you could try to copy 4 bytes at once per loop from source to destination using unsigned int *. The trick is to keep pucDest an uchar pointer incremented by 3. You have deal with the the last word as a special case, not to violate the array boundary of the destination array for the last byte.

Since copying an int is typically not slower than copying a char on most machines, I suspect this will give you roughly a speed increase of a factor 3.

score 3 · Accepted Answer · edited May 23 '17 at 11:47

3

Before going into "assembler like" mode by using SIMD intrinsic, you can try to tell your compiler that the source and destination pointer do not alias, using for example the __restrict__ keyword with gcc or __restrict with visual studio

edited May 23 '17 at 11:47

Community

1
1

answered Oct 15 '11 at 09:11

shodanex

14,975
11
57
91

David Hammen · Answer 3 · 2011-10-15T11:17:17.137

1

Expanding on Doc Brown's answer, unfold the loop so that you are writing three uint32_t integers for every four read (uint64_t will take even bigger gulps but will be a bit harder to deal with).

Reading and writing ints may be problematic if your pimg->m_data and cvmColorImg.data do not start on a uint32_t boundary. You can force this to be the case by putting an int or unsigned int data member before these char* data members. Or you can use something akin to Duff's device. Personal preference: I would just force the alignment. It makes the copy code cleaner, easier, and perhaps faster. A few wasted bytes for padding is a tiny cost.

As is typically the case for unrolled loops, you'll have to do something special to handle the end. (Duff's device handles the oddity at the start, so no go with Duff's device here.) A lot of the problems with the end of loop processing will go away if you pad the end of the output buffer so that it occupies 4*N bytes.

edited Oct 15 '11 at 11:17

answered Oct 15 '11 at 11:11

David Hammen

32,454
9
60
108

+1 for the 4-byte-alignment remark, though that will only apply for the source array (the destination will not be accessed aligned when moving a pointer 3 bytes). I suspect most modern C++ compilers can do loop unrolling as optimization for you, so no need to do this on your own. – Doc Brown Oct 15 '11 at 11:32
The destination can be forced to be access aligned by putting a `int pad_before_data;` data member right before the data member that serves as the destination. Putting a `int pad_after_data;` data member right after the destination means one can copy `int`s throughout. Strictly speaking, writing into the compiler generated pad between the end of some element and the start of the next is UB, but I've never seen that UB do anything but the expected behavior (it writes into the compiler-generated pad). – David Hammen Oct 15 '11 at 13:08

how do I optimize this loop further?

3 Answers3