2

I am learning about CUDA optimizations. I found a presentation on this link: Optimizing CUDA by Paulius Micikevicius.

In this presentation, they talk about

MAXIMIZE GLOBAL MEMORY BANDWIDTH

, they say global memory coalescing will improve the bandwidth.

My question, How do you calculate the Global Memory Bandwidth. Can anyone explain me with a simple program example.

veda
  • 6,416
  • 15
  • 58
  • 78
  • 1
    http://stackoverflow.com/questions/7876006/how-to-calculate-the-achieved-bandwidth-of-a-cuda-kernel ? – pQB Nov 02 '11 at 10:01

1 Answers1

7

Theoretical bandwidth can be calculated using hardware spec.

For example, the NVIDIA GeForce GTX 280 uses DDR RAM with a memory clock rate of 1,107 MHz and a 512-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of the NVIDIA GeForce GTX 280 is 141.6 GB/sec:

enter image description here

In this calculation, the memory clock rate is converted in to Hz, multiplied by the interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the double data rate. Finally, this product is divided by 10^9 to convert the result to GB/sec (GBps).

Effective bandwidth is calculated by timing specific program activities and by knowing how data is accessed by the program. To do so, use this equation:

Effective bandwidth = (( Br + Bw ) / 10^9 ) / time

Here, the effective bandwidth is in units of GBps, Br is the number of bytes read per kernel, Bw is the number of bytes written per kernel, and time is given in seconds.

More information is available in CUDA best practice guide.

SKPS
  • 5,433
  • 5
  • 29
  • 63
Yappie
  • 399
  • 2
  • 8
  • I'm so sorry. I I would have to use Google formula...It's just conversion to Hz – Yappie Dec 11 '11 at 22:36
  • The phrases about maximization bandwidth is usually mean "while your bandwidth is less then theoretical your code performance is limited to the calculation". Your goal is in achive theoretical bandwidth or near it – Yappie Dec 11 '11 at 22:46
  • Not really. If your code is not limited by the bandwidth, it might as well be limited by memory latency and not calculations. As you noted, it is the goal to be either compute bound, or memory bandwidth bound, or both! – angainor Sep 13 '12 at 21:06