Explicit memory prefetching for Intel Compilers

Question

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -

inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){

    for(int i=0; i < M; i++){
        for(int j=0; j <=N - 33; j+=32){

            auto pos = i*N + j;
            _mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));   
        
        }
    }
}

void secondFunction(uchar* input, uchar* output, size_t M, size_t N){

    for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
        for(int j = 0; j <N-33; j+=32){

            auto idx = i * N + j;
            auto idx_1 = (i+1)*N + j;

            _mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
        }
    }

However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms

Where the size of matrix is M = 9024 and N = 12032

This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.

So my question is two-part.

Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times

I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.

Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?

TIA

I suspect conflict eviction from cache. `N = 12032` is `0x2F00` Try making N in your array `N = 12064` (`0x2F20`) to see if the results are better? (Just don't use these extra elements) — Alex Guteniev, Dec 04 '21 at 13:25
`_mm_prefetch` takes one pointer, the second arg is `_MM_HINT_T0` for you, see https://stackoverflow.com/q/46521694/2945027 — Alex Guteniev, Dec 04 '21 at 13:28
@AlexGuteniev, My knowledge is rather nascent in this domain. Could you please elaborate or point to a resource about conflict eviction. Also I would try with 10264, but this code would go somewhere else, where matrix dimensions would not be in my hand, so would not padding a matrix nullify the benefits from SIMD ? — Atharva Dubey, Dec 04 '21 at 13:46
From https://www.agner.org/optimize/ get *Optimizing software in C++* (optimizing_cpp.pdf), look into chapters _9.2 Cache organization_ _9.10 Cache contentions in large data structures_ , they specifically discuss cache and matrices. — Alex Guteniev, Dec 04 '21 at 13:52
@chtz, thanks for your response. Could you please elaborate a little more, especially about the order of your operations and why that increment by 256? — Atharva Dubey, Dec 04 '21 at 13:58
Try disabling the optimization and check whether this helps. instead of -O3 use -Od flag option. — Noorjahan - Intel, Feb 18 '22 at 10:15

Explicit memory prefetching for Intel Compilers

0 Answers0