I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :
uint8_t* m_input; //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat
memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);
Neon:
int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;
for(int32_t htI =0;htI < m_roiHeight;htI++){
ht_index = htI * m_roiWidth ;
for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
wd_index = ht_index + wdI;
vector8x16_image = vld1q_u8(m_input);
vst1q_u8(&m_output[wd_index],vector8x16_image);
}
}
I verified multiple times these result on imx6 hardware.
Results:
Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec
I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.
If it is true then how my code is giving these results . Is it right or wrong