Understanding memory usage in CUDA

Question

I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0.

I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly.

However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations.

I want to understand this by an explicit example. Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.

Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks.

Number of blocks : 16384
Number of threads per block : 32 ( => total threads 32*16384 = 524288)
Each thread-block is given a 32 x 32 two-d integer array of shared memory to work with.

Within a thread I would like to store some numbers of type double. But I am not sure how many such double numbers I can store without any register spilling into local memory (which is on device). Can someone tell me how many doubles can be stored per thread for this kernel configuration?

Also is the above mentioned configuration for shared-memory for each of my blocks valid?

A sample computation about how one would go about deducing these things would be very illustrative and helpful

Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)

[deviceQuery] starting...
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

    Device 0: "GeForce GTX 570"
      CUDA Driver Version / Runtime Version          4.0 / 4.0
      CUDA Capability Major/Minor version number:    2.0
      Total amount of global memory:                 1279 MBytes (1341325312 bytes)
      (15) Multiprocessors x (32) CUDA Cores/MP:     480 CUDA Cores
      GPU Clock Speed:                               1.46 GHz
      Memory Clock rate:                             1900.00 Mhz
      Memory Bus Width:                              320-bit
      L2 Cache Size:                                 655360 bytes
      Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
      Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 32768
      Warp size:                                     32
      Maximum number of threads per block:           1024
      Maximum sizes of each dimension of a block:    1024 x 1024 x 64
      Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and execution:                 Yes with 1 copy engine(s)
      Run time limit on kernels:                     Yes
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Concurrent kernel execution:                   Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support enabled:                No
      Device is using TCC driver mode:               No
      Device supports Unified Addressing (UVA):      Yes
      Device PCI Bus ID / PCI location ID:           2 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570
    [deviceQuery] test results...
    PASSED

    Press ENTER to exit...

What do you mean by "Store"? In scalar variables (which will go in registers), or in an array (which will go in off-chip local memory)? — harrism, Sep 12 '12 at 06:07
I meant either scalar-variables or an array. From your comment, it seems if I store 8 scalar variables, in a thread they will be stored in regsiters but if I store an array of 8 doubles then the array will be stored in off-chip local memory? This seems very strange. Can you clarify? — curiousexplorer, Sep 12 '12 at 12:35
[This answer](http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable/12169588#12169588) should make it clear. — harrism, Sep 13 '12 at 01:33

score 1 · Answer 1 · answered Sep 12 '12 at 02:26

1

So, the kernel configuration is a little complicated. You should use the CUDA OCCUPANCY CALCULATOR. And the other hand you have to study how warps work. Once a block is assigned to a SM, it is further divided into 32-thread units called warps. We can say that a warp is a unit of thread scheduling in SMs. We can calculate the number of warps that reside in a SM for a given block size and given number of blocks assigned to each SM. In your case a warp consists in 32 threads, so if you have a block with 256 threads then you have 8 warps. Now choosing a correctly kernel setting depends of your data and operations, remember that you have to full occupy a SM, that is: you have to get full thread capacity in each SM and the maximal number of warps for scheduling around the long-latency operations. Another important thing is dont exceed the limitations of up to maximum threads per blocks, in your case 1024.

answered Sep 12 '12 at 02:26

FacundoGFlores

7,858
12
64
94

1

CC 2.x devices can have 8 blocks per SM. CC 3.x devices can have 16 blocks per SM. If you only have 32 threads per block you will hit device limits early. You can increase your threads per block and shared memory per block to get around this issue. This will require that you change your indexing calculation both per thread and for shared memory but you will likely see a big performance improvement. – Greg Smith Sep 12 '12 at 02:45
1

Doubles require 2 adjacent 32-bit registers. For a GTX570 (CC2.0) each SM has 32K registers. With 32 threads per block you will be limited by blocks per device to 256 threads. You will only be using 1/2 the available registers. Depending on how much control code you have you will be able to fit 20-30 doubles in the RF assuming 63 registers per thread. If you follow @facunvd link to the calculator and set the device to 2.0 you should see this limit. You will also see why I suggest increasing threads per block. – Greg Smith Sep 12 '12 at 02:49
I want to understand you @GregSmith. You are saying my answer is right? If this were so, then you are saying us that we have not to worry only for how many warps we can run in a SM but also we must consider number of blocks per SM? – FacundoGFlores Sep 12 '12 at 02:51
1

Yes, each device has a limit of blocks per SM. If you launch only 1 warp per block then you will limit yourself on CC 2.0 to 8 warps per SM or 16.6% occupancy (8/48) which is not sufficient to hide latency. Furthermore, at this occupancy you can't even fully utilize the register file as this requires (32K/ MAX_REGISTER_THREAD) at least 512 threads (16 warps). – Greg Smith Sep 12 '12 at 02:55
@GregSmith "...fit 20-30 doubles in the RF assuming 63 registers per thread". What does `RF` mean here? – curiousexplorer Sep 12 '12 at 13:56
@curiousexplorer Sorry for using an abbreviation. Registers are allocated from the register file (RF). You'll also see this called the local register file (LRF) is some documents. – Greg Smith Sep 12 '12 at 15:16

Understanding memory usage in CUDA

1 Answers1