4

I am developing a CUDA application for GTX 580 with CUDA Toolkit 4.0 and Visual Studio 2010 Professional on Windows 7 64bit SP1. My program is more memory-intensive than typical CUDA programs, and I am trying to allocate as much shared memory as possible to each CUDA block. However, the program crashes every time I try to use more than 32K of shared memory for each block.

From reading official CUDA documentations, I learned that there is 48KB of on-die memory for each SM on a CUDA device with Compute Capability of 2.0 or greater, and the on-die memory is split between L1 cache and shared memory:

The same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call (Section F.4.1) http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Fermi_Tuning_Guide.pdf

This led me to suspect that only 32KB of one-die memory was allocated as shared memory when my program was running. Hence my question: Is it possible to use all of 48KB of on-die memory as shared memory?

I tried everything I could think of. I specified the option --ptxas-options="-v -dlcm=cg" for nvcc, and I called cudaDeviceSetCacheConfig() and cudaFuncSetCacheConfig() in my program, but none of them resolved the issue. I even made sure that there was no register spilling and that I did not accidentally use local memory:

1>      24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]

Although I can live with 32KB of shared memory, which already gave me a huge performance boost, I would rather take full advantage of all of the fast on-die memory. Any help is much appreciated.

Update: I was launching 640 threads when the program crashed. 512 gave me a better performance than 256 did, so I tried to increase the number of threads further.

paleonix
  • 2,293
  • 1
  • 13
  • 29
meriken2ch
  • 409
  • 5
  • 15
  • How many threads are you launching? – pQB Sep 13 '12 at 08:48
  • I was launching 640 threads when the program crashed. 512 gave me a better performance than 256 did, so I tried to increase the number of threads further. – meriken2ch Sep 13 '12 at 08:59

3 Answers3

6

Your problem is not related to the shared memory configuration but with the number of threads you are launching.

Using 63 register per threads and launching 640 threads give you a total of 40320 registers. The total amount of register of your device is 32K, so you are running out of resources.

Regarding to the on-chip memory is well explained in the Tom's answer, and as he commented, check the API calls for errors will help you for future errors.

pQB
  • 3,077
  • 3
  • 23
  • 49
3

Devices of compute capability 2.0 and higher have 64KB of on-chip memory per SM. This is configurable as 16KB L1 and 48KB smem or 48KB L1 and 16KB smem (also 32/32 on compute capability 3.x).

Your program is crashing for another reason. Are you checking all API calls for errors? Have you tried cuda-memcheck?

If you use too much shared memory then you will get an error when you launch the kernel saying that the were insufficient resources.

Tom
  • 20,852
  • 4
  • 42
  • 54
  • Thank you for the correction. I was checking API calls for errors, and none was detected. I will try cuda-memcheck to see if there is anything suspicious. – meriken2ch Sep 13 '12 at 09:04
  • Actually, I just found that I was missing an error. Thank you! – meriken2ch Sep 13 '12 at 09:12
-1

Also, passing parameters from the host to the GPU uses the shared memory (up to 256 bytes) so you will never get the actual 48KB.

JPM
  • 445
  • 1
  • 5
  • 15
  • 2
    not in this case. the device in questions a compute 2.x device and kernel arguments are passed in constant memory not shared memory – talonmies Sep 20 '12 at 17:18