Dynamic Indexing of Local Memory
So I did some digging on this because I wanted to understand this too.
If you are referring to dynamic indexing of local memory, not tile_static
(or in CUDA parlance, "shared memory"). In your example above staticArray
should be declared as:
int staticArray[128]; // not tile_static
This cannot be dynamically indexed because an array of int staticArray[128]
is actually stored as 128 registers and these cannot be dynamically accessed. Allocating large arrays like this is problematic anyway because it uses up a large number of registers which are a limited resource on the GPU. Use too many registers per thread and your application will be unable to use all the available parallelism because some available threads will be stalled waiting for registers to become available.
In the case of C++ AMP I'm not even sure that the level of abstraction provided by DX11 may make this somewhat irrelevant. I'm not enough of an expert on DX11 to know.
There's a great explanation of this here, In a CUDA kernel, how do I store an array in "local thread memory"?
Bank Conflicts
Tile static memory is divided into a number of modules referred to as
banks. Tile static memory typically consists of 16, 32, or 64 banks,
each of which is 32 bits wide. This is specific to the particular GPU
hardware and might change in the future. Tile static memory is
interleaved across these banks. This means that for a GPU with tile
static memory implemented with 32 banks if arr is an array < float, 1>, then arr[ 1] and arr[ 33] are in the same bank because each float occupies a single 32-bit bank location. This is the key point to
understand when it comes to dealing with bank conflicts.
Each bank can
service one address per cycle. For best performance, threads in a warp
should either access data in different banks or all read the same data
in a single bank, a pattern typically optimized by the hardware. When
these access patterns are followed, your application can maximize the
available tile static memory bandwidth. In the worst case, multiple
threads in the same warp access data from the same bank. This causes
these accesses to be serialized, which might result in a
significant degradation in performance.
I think the key point of confusion might be (based on some of your other questions) is that a memory bank is 32 bits wide but is responsible for access to all the memory within the bank, which will be 1/16, 1/32 or 1/64 of the total tile static memory.
You can read more about bank conflicts here What is a bank conflict? (Doing Cuda/OpenCL programming)