0

Suppose I have an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block, what are some pros and cons in terms of access speed and memory usage if I store these values in:

  1. thread-local memory: double x[3] = {1,2,3};

  2. shared memory: __shared__ double x[3] = {1,2,3};

  3. numeric literals: directly hardcode these values in the expression where they appear

Thanks!

talonmies
  • 70,661
  • 34
  • 192
  • 269
username123
  • 913
  • 1
  • 11
  • 29

1 Answers1

1

TL;DR

use __constant__ double x[3]; // ... initialization ...

First, know where a variable actually resides

In your question:

  1. thread-local memory: double x[3] = {1,2,3};

This is imprecise. Depends on how your code access x[], x[] can reside in either registers or local memory.

Since there is no type qualifiers, the compiler will try best to put things in register,

An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory,

but when it can't, it will put them in local memory:

  • Arrays for which it cannot determine that they are indexed with constant quantities,

  • Large structures or arrays that would consume too much register space,

  • Any variable if the kernel uses more registers than available (this is also known as register spilling).

You really don't want x to be in local memory, it's slow. In your situation,

an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block

Both __constant__ and __shared__ can be a good choice.

For a complete description on this topic, check: CUDA Toolkit Documentation: variable-type-qualifiers

Then, consider speed & availability

Hardcode

The number will be embedded in instructions. You may expect some performance improvement. Better benchmark your program before and after doing this.

Register

It's fast, but scarce. Consider a block with 16x16 threads, with a maximum 64k registers per block, each thread can use 256 registers. (Well, maybe not that scarce, should be enough for most kernels)

Local Memory

It's slow. However, a thread can use up to 512KB local memory.

The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses...

Shared Memory

It's fast, but scarce. Typically 48KB per block (less than registers!).

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.

Constant Memory

It's fast in a different way (see below), which highly depends on cache, and cache is scarce. Typically 8KB ~ 10KB cache per multiprocessor.

The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

read: CUDA Toolkit Documentation: device-memory-accesses

lazyplayer
  • 373
  • 1
  • 8
  • 1
    Your very first recommendation is illegal in CUDA. Did you actually try compiling that piece of code? – Robert Crovella Apr 07 '17 at 21:37
  • @RobertCrovella oops, I didn't. I currently don't have a CUDA enabled machine in hand. Will try moving the initialization code to main() later. – lazyplayer Apr 07 '17 at 23:41
  • Using registers as you mentioned requires a first read from some location. Is it parameter passing? – Florent DUGUET Apr 08 '17 at 09:10
  • @FlorentDUGUET I read through my answer and failed to find something like 'first read from some location', can you explain more? By the way, registers are used for store local variables and temporary results, and are handled by compiler. They may first read original data from global memory or shared memory, for example. – lazyplayer Apr 08 '17 at 15:02
  • There is no such thing as a block with 64x64 threads. That is illegal in CUDA. – Robert Crovella Apr 09 '17 at 19:36
  • @RobertCrovella Fixed. – lazyplayer Apr 10 '17 at 03:01