2

I'm going to great lengths to try and store frequently accessed data in tile_static memory to take advantage of the boundless performance nirvana which will ensue.

However, I've just read that only certain hardware/drivers can actually dynamically index tile_static arrays, and that the operation might just spill over to global memory anyway.

In an ideal world I'd just do it and profile, but this is turning out to be a major operation and I'd like to get an indication as to whether or not I'm wasting my time here:

tile_static int staticArray[128];
int resultFast = staticArray[0]; // this is super fast

// but what about this:   
i = // dynamically derived value!
int resultNotSoFast = staticArray[i]; // is this faster than getting it from global memory?

How can I find out whether my GPU/driver supports dynamic indexing of static arrays?

quant
  • 21,507
  • 32
  • 115
  • 211
  • Where did you read this. It will help me answer the question. – Ade Miller Nov 11 '13 at 16:16
  • @AdeMiller I read it here: http://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/ specifically in regards to *shared memory*: `There are 32 threads in a warp and exactly 32 shared memory banks. Because each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. This will be discussed further in the next post.` – quant Nov 12 '13 at 05:19
  • @AdeMiller Sorry I might have gotten things muddled up. Honestly I'm quite confused by all these new terms. I think you're right - I was talking about *local* memory and that article on *shared memory* is something different. – quant Nov 12 '13 at 05:22
  • No. This is hard. Effectively the tradeoff you make with GPU programming is you get to think about and implement a lot of the things that the various levels of cache give you on a CPU. In return you can get better performance. I added another section to my answer for you. – Ade Miller Nov 12 '13 at 06:12

1 Answers1

2

Dynamic Indexing of Local Memory

So I did some digging on this because I wanted to understand this too. If you are referring to dynamic indexing of local memory, not tile_static (or in CUDA parlance, "shared memory"). In your example above staticArray should be declared as:

int staticArray[128]; // not tile_static

This cannot be dynamically indexed because an array of int staticArray[128] is actually stored as 128 registers and these cannot be dynamically accessed. Allocating large arrays like this is problematic anyway because it uses up a large number of registers which are a limited resource on the GPU. Use too many registers per thread and your application will be unable to use all the available parallelism because some available threads will be stalled waiting for registers to become available.

In the case of C++ AMP I'm not even sure that the level of abstraction provided by DX11 may make this somewhat irrelevant. I'm not enough of an expert on DX11 to know.

There's a great explanation of this here, In a CUDA kernel, how do I store an array in "local thread memory"?

Bank Conflicts

Tile static memory is divided into a number of modules referred to as banks. Tile static memory typically consists of 16, 32, or 64 banks, each of which is 32 bits wide. This is specific to the particular GPU hardware and might change in the future. Tile static memory is interleaved across these banks. This means that for a GPU with tile static memory implemented with 32 banks if arr is an array < float, 1>, then arr[ 1] and arr[ 33] are in the same bank because each float occupies a single 32-bit bank location. This is the key point to understand when it comes to dealing with bank conflicts.

Each bank can service one address per cycle. For best performance, threads in a warp should either access data in different banks or all read the same data in a single bank, a pattern typically optimized by the hardware. When these access patterns are followed, your application can maximize the available tile static memory bandwidth. In the worst case, multiple threads in the same warp access data from the same bank. This causes these accesses to be serialized, which might result in a significant degradation in performance.

I think the key point of confusion might be (based on some of your other questions) is that a memory bank is 32 bits wide but is responsible for access to all the memory within the bank, which will be 1/16, 1/32 or 1/64 of the total tile static memory.

You can read more about bank conflicts here What is a bank conflict? (Doing Cuda/OpenCL programming)

Community
  • 1
  • 1
Ade Miller
  • 13,575
  • 1
  • 42
  • 75