4

I am currently trying to benchmark the Jetson TX1 against the jetson NANO, according to https://elinux.org/Jetson, they both have the maxwell architecture with 128 cuda cores for NANO and 256 for TX1. This means that normally Jetson NANO will achieve half the performance of the TX1.

To test this, I created a single (float) operation multiplication kernel as follows:

__global__ void matrixMultiply(float* mat1, float* mat2, int nx, int ny)
    {
        unsigned int ix = threadIdx.x + blockDim.x*blockIdx.x;
        unsigned int iy = threadIdx.y + blockDim.y*blockIdx.y;
        int idx = iy*nx + ix;

        mat1[idx] = mat1[idx]*mat2[idx] ;

    }

Test : the multiplication of 2 "float array of size 15000*15000" resulted for TX1 = 130 ms and Jetson NANO = 150 ms. The result seems weird, it's like I am not using the second SM of TX1, therefore I profiled using sm_efficiency (TX1 and NANO = 100%) , achieved_occupancy (TX1 = 92%, NANO = 88 %). Am I missing something here or I just don't use the proper grid and block configuration.

P.S: I tried all possible configuration and the best configuration for both platforms was a block of (256, 1) and a grid calculated accordingly.

roody
  • 41
  • 1
  • It's really impossible to give an accurate answer without a complete, reproducible example of your benchmarking code. – Michael Jul 09 '19 at 21:59

1 Answers1

7

Am I missing something here

Yes you are missing something here. Your code does not measure what you think:

they both have the maxwell architecture with 128 cuda cores for NANO and 256 for TX1. This means that normally Jetson NANO will achieve half the performance of the TX1.

That statement is approximately true if the limiting factor for your code is the compute performance related to the CUDA cores. However, for your code, it is not, and this is fairly straightforward to prove.

We will start with some specifications:

spec                 | TX1         | Nano     | source
---------------------=-------------=----------=----------
mem bandwidth (GB/s) | 25.6        | 25.6     | 1,2
---------------------=-------------=----------=----------
(FP32) compute cores | 256         | 128      | 1,2
---------------------=-------------=----------=----------
max core clock (MHz) | 998         | 921      | 1,2

sources: 1, 2

To compute the maximum theoretical FP32 compute throughput, the formula is:

# of SMs * # of FP32 units per SM * 2 * clock rate

For Jetson NANO:

128 * 2 * 921MHz = ~236GFlops/s

For Jetson TX1:

256 * 2 * 998MHz = ~511GFlops/s

(the 2 multiplier in the above formulas is due to the fact that the maximum throughput is for a code that does multiply-add operations, not just multiply)

Now lets analyze the ratio of FP32 compute to memory utilization in your code (ignoring any integer arithmetic for index calculation):

    mat1[idx] = mat1[idx]*mat2[idx] ;

We see that for each FP32 multiply operation, we must read two quantities (8 bytes total) and write one quantity (4 bytes total). So 12 bytes read/write for each multiply operation.

Now let's suppose you could hit the peak multiply throughput on TX1 of 511GFlops/s. That is 511,000,000,000 multiply-add operations per second, or ~256,000,000,000 multiply operations. If you could hit 256B multiply operations per second, each multiply would need 12 bytes of read/write activity, so the total bandwidth required would be:

256,000,000,000 multiply ops              12 bytes        3,072,000,000,000 bytes
----------------------------    *        -----------   =  -----------------------
            sec                          multiply op              sec

That means it would require ~3 Terabytes per second of memory bandwidth, for your code to be limited by the compute throughput of TX1. But TX1 only has 25.6 Gigabytes per second of memory bandwidth. So the memory bandwidth of TX1 will limit the throughput of your code. A similar calculation shows that memory bandwidth of NANO will also limit the throughput of your code, and therefore the predictor for performance ratio between the two for your code is the ratio of memory bandwidth:

25.6GB/s
--------     = 1
25.6GB/s

Therefore the fact that you observed almost the same performance between the two:

150
---          = 1.15
130

is a much more sensible outcome, for your code, than to expect the performance ratio to be 2:1.

If you want to see a code that comes closer to the 2:1 ratio, you'll need a code that does a lot of compute operations while consuming (relatively speaking) almost no memory bandwidth. A possible real-world example of such a code might be a matrix-matrix multiply, and you can easily write a CUBLAS Sgemm code to test this. Note that a 2:1 ratio expectation isn't quite right here, because the core clocks are not the same. The expected ratio would be:

511
--- = ~2.17
236
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257