0

I have a kernel which makes use of four different memories:

  • memory A (around 2MB) is used only once (for loading)
  • memory B (around 2MB) is used only once (for storing)
  • memory C (32KB) and D (32KB) are read-only and is accessed hundreds of times

Memory A and B access coalesce but C and D accesses do not (but not random; a huge contiguous chunk of A/B use the same 32 bytes of C/D).

Memory A and B have a one-to-one correspondence, i.e. every 16 bytes of memory B is modified 16 bytes of memory A.

Stores are presumably not cached and would not evict/invalidate existing cache lines if all the pointers are marked __restrict. No problems with memory B.

Memory C and D can be loaded using __ldg since it is used frequently and is not mutable.

Memory A is used only once. Hence, there is simply no point in caching the loads from A. Unfortunately, by default, they are cached in L2. This might cause the useful cache lines containing C and D to be evicted.

How do I inform the compiler that I do not want loads from memory A to be cached?

Yashas
  • 1,154
  • 1
  • 12
  • 34
  • 2
    You have no ability to control caching in L2. The controls given to the CUDA programmer for caching affect L1. All reads and writes to device memory flow through the L2. To prevent caching in L1 you can use an [uncached load](https://stackoverflow.com/questions/30420774/making-some-but-not-all-cuda-memory-accesses-uncached). Your question is arguably a duplicate of that one (i.e. [this one](https://stackoverflow.com/questions/30420774/making-some-but-not-all-cuda-memory-accesses-uncached)). – Robert Crovella Mar 20 '20 at 13:40
  • I'm wondering that if memory C and D are loaded using `__ldg`(i.e. cached in read-only cache), will caching A in L2 affects the loading performance of C and D? – Gnimuc Mar 21 '20 at 05:28
  • 1
    @Gnimuc C and D do not fully fit in L1. – Yashas Mar 21 '20 at 06:31
  • I thought they just fit in the unified L1/Texture cache for architectures like Turing and Volta whose unified cache size could be up to 96K and 128K respectively(if no shared mem is used). – Gnimuc Mar 21 '20 at 06:56

0 Answers0