I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
- CPU cache. How CPUs fill up their caches to gain best hit rate?
- I suppose SSE could improve this?
- The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.
Any suggestions would be appreciated!