Replacing memcpy with neon intrinsics

Question

I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :

uint8_t* m_input;  //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat 

memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);

Neon:

int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;

for(int32_t htI =0;htI < m_roiHeight;htI++){
    ht_index = htI * m_roiWidth ;

    for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
        wd_index = ht_index + wdI;
        vector8x16_image = vld1q_u8(m_input);

        vst1q_u8(&m_output[wd_index],vector8x16_image);
    }
}

I verified multiple times these result on imx6 hardware.

Results:

Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec

I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.

If it is true then how my code is giving these results . Is it right or wrong

I have to wonder if your compiler vendor of choice has not already supplied you with a specialized version of memcpy specific to your platform. Also, yes, if you look online for any amount of time, you should find ARM memcpy() functions that properly use `PLD` to speed things up. — Michael Dorgan, May 07 '15 at 21:38
Read "What Every Programmer Should Know About Memory" (available free) to see how these kinds of tests are done scientifically. Skip the parts too deep, look at the graphs and try to imitate them. Later on you'll develop more understanding of the subject and can read deeper parts. — auselen, May 08 '15 at 07:32

score 6 · Answer 1 · answered May 07 '15 at 18:41

If correctly written, a non-NEON memcpy() should be able to saturate the L3 bandwidth on your device, but for smaller transfers (fitting entirely within L1 or L2 cache) things can be different. Your test probably fits within L2 cache.

Unfortunately memcpy has to work for any sized call, so it can't reasonably optimise for in-cache and out-of-cache cases at the same time as optimising for very short copies where the cost of detecting what kind of optimisation would be best turns out to be the dominant factor.

Even so, it's possible that your test isn't fair. You have to be sure that both implementations aren't subject to different cache preconditions or different virtual page layout.

Make sure neither test is run entirely before the other. Test some of one implementation, then test some of the other, then back to the first and back to the second a few times, to make sure they're not subject to any warm-up conditions. And use the same buffers for both to ensure that there's no characteristic of different parts of your virtual address space that harms one implementation only.

Also, there are cases your memcpy doesn't handle, but these shouldn't matter much for large transfers.

Replacing memcpy with neon intrinsics

1 Answers1

Linked