CUDA disable L1 cache only for one variable

Question

Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only for memory reads upon a specific global variable so that all of the rest of the memory reads to go through the L1 cache.

Based on a search I have done in the web, a possible solution is through PTX assembly code.

score 15 · Accepted Answer · edited May 20 '16 at 19:33

15

As mentioned above you can use inline PTX, here is an example:

__device__ __inline__ double ld_gbl_cg(const double *addr) {
  double return_value;
  asm("ld.global.cg.f64 %0, [%1];" : "=d"(return_value) : "l"(addr));
  return return_value;
}

You can easily vary this by swapping .f64 for .f32 (float) or .s32 (int) etc., the constraint of return_value "=d" for "=f" (float) or "=r" (int) etc. Note that the last constraint before (addr) - "l" - denotes 64 bit addressing, if you are using 32 bit addressing, it should be "r".

edited May 20 '16 at 19:33

Vladislav Ivanishin

2,092
16
22

answered Sep 23 '12 at 20:39

Reguj

226
1
2

@Reguj, is this not provided by NVIDIA's headers anywhere? – einpoklum Feb 29 '16 at 14:08
[this](https://nvlabs.github.io/cub/classcub_1_1_cache_modified_input_iterator.html#details) may be of interest – Robert Crovella Jan 02 '18 at 21:40
@einpoklum I believe [`__ldcg()`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ldx-functions) may be what you’re looking for. – swineone Feb 04 '21 at 03:45

score 5 · Answer 2 · edited Mar 30 '15 at 21:03

5

Inline PTX can be used to load and store the variable. ld.cg and st.cg instructions only cache data in L2. The cache operators are described in section 8.7.8.1 Cache Operators of the PTX ISA 2.3 document. The instructions or interest are ld and st. Inline PTX is described in Using Inline PTX Assembly in CUDA.

edited Mar 30 '15 at 21:03

chappjc

30,359
6
75
132

answered Sep 23 '12 at 16:56

Greg Smith

11,007
2
36
37

score 0 · Answer 3 · answered Sep 23 '12 at 17:59

0

If you declare the variable to be volatile, then it will only be cached in the L2 cache on Fermi GPUs. Note that some compiler optimizations, such as removing repeated loads, are not performed on volatile variables because the compiler assumes they may be written by another thread.

answered Sep 23 '12 at 17:59

Heatsink

7,721
1
25
36

1

I don't think the programming model makes any representation about the cacheability of volatile variables. – ArchaeaSoftware Sep 23 '12 at 20:57
@Archaea The Fermi architecture makes caching of volatile data infeasible, due to the absence of cache coherence. Having run into errors in the CUDA documentation in the past, I don't consider CUDA's memory model documentation reliable. – Heatsink Sep 23 '12 at 23:23
I tried the solution with volatile variable decleration and it did not work. It seems that variable is cached again. – zeus2 Sep 24 '12 at 01:50
3

@Heatsink, what I meant is that no one should be refactoring their code to use the 'volatile' keyword to change the cacheability of the affected variables. Folks who want to direct the cacheability of loads and stores have to use inline PTX, as described in the answer above. – ArchaeaSoftware Sep 24 '12 at 05:16

CUDA disable L1 cache only for one variable

3 Answers3

Linked