1

How to calculate gpu memory bandwidth with given:

  1. data sample size (in Gb).
  2. kernel execution time (nvprof output).

GPU: gtx 1050 ti
Cuda: 8.0
OS: Windows 10
IDE: Visual studio 2015

Normally I would use this formula: bandwidth [Gb/s] = data_size [Gb] / average_time [s].

But when I use the equation above for get_mem_kernel() kernel I get the wrong result: 441,93 [Gb/s].

I consider this result to be wrong because in tech specs for gtx 1050 ti stands that global memory bandwidth is 112 [Gb\s].

Where did I make a mistake or is there something else that I do not understand?

Sample code:

// cpp libs:
#include <iostream>
#include <sstream>
#include <fstream>
#include <iomanip>
#include <stdexcept>

// cuda libs:
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

#define ERROR_CHECK(CHECK_) if (CHECK_ != cudaError_t::cudaSuccess) { std::cout << "cuda error" << std::endl; throw std::runtime_error("cuda error"); }

using data_type = double;

template <typename T> constexpr __forceinline__
T div_s(T dividend, T divisor)
{
    using P = double;
    return static_cast <T> (static_cast <P> (dividend + divisor - 1) / static_cast <P> (divisor));
}

__global__
void set_mem_kernel(const unsigned int size, data_type * const in_data)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size)
    {
        in_data[idx] = static_cast <data_type> (idx);
    }
}

__global__
void get_mem_kernel(const unsigned int size, data_type * const in_data)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data_type val = 0;
    if (idx < size)
    {
        val = in_data[idx];
    }
}

struct quit_program
{
public:
    ~quit_program()
    {
        try
        {
            ERROR_CHECK(cudaDeviceReset());
        }
        catch (...) {}
    }
} quit;

int main()
{
    unsigned int size = 12500000; // 100 mb;
    size_t       byte = size * sizeof(data_type);

    dim3 threads (256, 1, 1);
    dim3 blocks  (div_s(size, threads.x), 1, 1);

    std::cout << size << std::endl;
    std::cout << byte << std::endl;
    std::cout << std::endl;

    std::cout << threads.x << std::endl;
    std::cout << blocks.x  << std::endl;
    std::cout << std::endl;

    // data:
    data_type * d_data = nullptr;
    ERROR_CHECK(cudaMalloc(&d_data, byte));

    for (int i = 0; i < 20000; i++)
    {
        set_mem_kernel <<<blocks, threads>>> (size, d_data);
        ERROR_CHECK(cudaDeviceSynchronize());
        ERROR_CHECK(cudaGetLastError());

        get_mem_kernel <<<blocks, threads>>> (size, d_data);
        ERROR_CHECK(cudaDeviceSynchronize());
        ERROR_CHECK(cudaGetLastError());
    }

    // Exit:
    ERROR_CHECK(cudaFree(d_data));
    ERROR_CHECK(cudaDeviceReset());
    return EXIT_SUCCESS;
}

nvproof result:

D:\Dev\visual_studio\nevada_test_site\x64\Release>nvprof ./cuda_test.exe
12500000
100000000

256
48829

==10508== NVPROF is profiling process 10508, command: ./cuda_test.exe
==10508== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
==10508== Profiling application: ./cuda_test.exe
==10508== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 81.12%  19.4508s     20000  972.54us  971.22us  978.32us  set_mem_kernel(unsigned int, double*)
 18.88%  4.52568s     20000  226.28us  224.45us  271.14us  get_mem_kernel(unsigned int, double*)

==10508== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 97.53%  26.8907s     40000  672.27us  247.98us  1.7566ms  cudaDeviceSynchronize
  1.61%  443.32ms     40000  11.082us  5.8340us  183.43us  cudaLaunch
  0.51%  141.10ms         1  141.10ms  141.10ms  141.10ms  cudaMalloc
  0.16%  43.648ms         1  43.648ms  43.648ms  43.648ms  cudaDeviceReset
  0.08%  22.182ms     80000     277ns       0ns  121.07us  cudaSetupArgument
  0.06%  15.437ms     40000     385ns       0ns  24.433us  cudaGetLastError
  0.05%  12.929ms     40000     323ns       0ns  57.253us  cudaConfigureCall
  0.00%  1.1932ms        91  13.112us       0ns  734.09us  cuDeviceGetAttribute
  0.00%  762.17us         1  762.17us  762.17us  762.17us  cudaFree
  0.00%  359.93us         1  359.93us  359.93us  359.93us  cuDeviceGetName
  0.00%  8.3880us         1  8.3880us  8.3880us  8.3880us  cuDeviceTotalMem
  0.00%  2.5520us         3     850ns     364ns  1.8230us  cuDeviceGetCount
  0.00%  1.8240us         3     608ns     365ns  1.0940us  cuDeviceGet

CUDA Samples\v8.0\1_Utilities\bandwidthTest result:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1050 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11038.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11469.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     95214.0

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
PatrykB
  • 1,579
  • 1
  • 15
  • 24
  • Run the `bandwidthTest` in Samples/1_Utilities folder to give you an estimate of actual of your card. The code is not so difficult to understand either, will give you some pointers. – zindarod Sep 05 '17 at 22:32
  • 2
    you might be hitting in one of the caches, meaning you will perceive higher bandwidth. But nvprof offers metrics that may give you a better measurement than anything you might try to calculate yourself. [This](https://stackoverflow.com/questions/37732735/nvprof-option-for-bandwidth/37740119#37740119) may be of interest. – Robert Crovella Sep 05 '17 at 22:36
  • 2
    are you building a debug project or a release project ? For a release project, your `get_mem_kernel` doesn't do anything that affects global state with the data that is being read, therefore the compiler is free to optimize away the actual loads. You can confirm this by looking at the kernel disassembly, or asking the profiler for the actual achieved bandwidth. – Robert Crovella Sep 05 '17 at 23:45
  • @RobertCrovella it's a `release` build. I didn't get it in the first place, but after you pointed that out, it was so obvious. The compiler was optimising away the entire variable `val`. Thank you for time. By the way you mention "hiting the catche" does that mean one can achieve actual memory bandwidth higher then the one in gpu tech specs? – PatrykB Sep 05 '17 at 23:57
  • I made that comment before inspecting your code carefully. You are not likely to see much cache benefit for a data set size of 100MB. However if you reduced the data set size to e.g. ~1MB, you would likely see substantial L2 cache benefit, because the set kernel will populate the cache for the get kernel. Yes, the L2 cache generally has higher bandwidth than global memory. – Robert Crovella Sep 06 '17 at 00:00
  • @RobertCrovella sorry for a stupid question (I don't know much about hardware level stuff): since catche offers additional boost, is there a way to group memory reads in such way to take advantage of this fact? Even with `ptx`? Or it can not be done because it's a hardware level limitation? – PatrykB Sep 06 '17 at 00:29
  • please study caches. The usage of L2 cache on the GPU would be similar to use of a CPU cache. It's not possible to cover it in the space of comments. – Robert Crovella Sep 06 '17 at 00:33

1 Answers1

0

Compiler was optimising away memory reads. It was pointed out by Robert Crovella. Thank you for your help - I would never guess it.

Detailed:
My compiler was optimising away val variable and by extension memory reads.

PatrykB
  • 1,579
  • 1
  • 15
  • 24