I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j]
and second would do M[i+1][j] - M[i][j]
, M
being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024
and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
- Is my reasoning for the difference in runtimes correct.
- How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would
_mm_prefetch
help if the issue is indeed of the memory access times
I am using the dpcpp
compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread
. This compiler has a pragma prefetch
but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch
, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);
. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA