Making some, but not all, (CUDA) memory accesses uncached

Question

I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO).

Can this be done...

For a single kernel individually?
At run time rather than at compile time?
For writes only rather than for reads and writes?

http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GlobalMemory.pdf — talonmies, May 24 '15 at 06:45

score 4 · Accepted Answer · edited May 23 '17 at 10:27

4

Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details].
No, it is an instruction level feature of PTX. You can JIT a version of code containing non-caching memory loads at runtime, but that is still technically compilation. You could probably use some template tricks and separate compilation to get the runtime to hold two versions of the same code built with or without caching and choose between those versions at runtime. You could also use the same tricks to get two versions of a given kernel without or without inline PTX for uncached loads [see here for one possibility of achieving this]
These non-caching instructions bypass the L1 cache with byte level granularity to L2 cache. So they are load only (all writes invalidate L1 cache and store to L2).

edited May 23 '17 at 10:27

Community

1
1

answered May 24 '15 at 06:54

talonmies

70,661
34
192
269

Wait, so you're saying that uncached means only un-L1-cached? That is, you can only bypass L1 cache? – einpoklum May 24 '15 at 16:19
1

You can read the PDF I linked to for yourself, but AFIAK L2 cache isn't bypassable – talonmies May 24 '15 at 17:49
1

Correct. The L2 cache cannot be bypassed. Also note that Kepler by default [bypasses L1 cache for global space accesses](http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#l1-cache), and [Maxwell L1 "functionality" has been combined with the texture cache](http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#l1-cache), i.e. removed from the physical resource that provides the shared memory in the SM. Like Kepler, Maxwell L1 global load caching may also be disabled by default (read the link.) Kepler GK110B = Tesla K40/K80 – Robert Crovella May 24 '15 at 17:50
1

@einpoklum: A couple of other possibilities for doing this popped into my head (template tricks and inline PTX to get two versions of a kernel with and without caching loads). I've expanded the answer a bit with a couple of links. – talonmies May 25 '15 at 08:40
Your help and guidance are very much appreciated, with this question and others. – einpoklum May 25 '15 at 08:48

score 4 · Answer 2 · answered Oct 03 '16 at 13:27

I don't know if it was possible before, but CUDA 8.0 gives you a possibility to fine-tune caching for specific reads/writes. See PTX manual for details.

For example, to make this code always go to the main memory on read:

const float4 val = input[i];

you could write the following:

float4 val;
const float4* myinput = input+i;
asm("ld.global.cv.v4.f32 {%0, %1, %2, %3}, [%4];" : "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w) : "l"(myinput));

I managed to speed up one of my cache-intensive kernels by about 20% using non-cached reads and writes for data that was accessed only once by design

Making some, but not all, (CUDA) memory accesses uncached

2 Answers2

Linked

Related