0

Is there a Linux/windows command that shows the type of the processor in an Nvidia GPU? I am not talking about the operating system nor the CPU type. I am asking about the processors (cores) in the GPU itself. At the end of the day, they are processors. How to tell if they have 32/64-bit registers and 32/64-bit ALUs?

Another question that is related to this, are 64-bit instructions, such as adding two (unsigned long int) numbers, emulated using 32-bit instructions, by the compiler or whatever intermediate thing, or they are being executed natively by the hardware?

This question is not quite similar to this, I need a way to tell what type of machine is the GPU itself. Also, the answer to that question does not tell how 64-bit instructions are specifically executed.

Community
  • 1
  • 1
caesar
  • 181
  • 8
  • I'm not really familiar with Nvidia GPUs but such informations should be in the data sheets or other manuals. If there are no informations out in the public then you will probably need an NDA with Nvidia to get this information. So do you have access to the documentation of the GPU you are targeting? – fsasm Feb 13 '17 at 23:43
  • GPU specs do not show these information. I think there should be an API/command that can tell such an information!!! – caesar Feb 14 '17 at 07:34
  • In general data sheets should show such information, because that is their purpose. If the vendor doesn't publish the information then you don't need it. The driver together with PTX hide all the details of the hardware to increase portability. If you really need this piece of information you should contact Nvidia. – fsasm Feb 14 '17 at 08:20
  • I will comment, rather than answer, because this seems like a "general compute hardware" question, thus off-topic. GPU registers are 32-bit registers. 64-bit operands are typically stored in register pairs (adjacent registers), in particular for `double` operands. 64-bit integer operations are emulated (with the exception of conversions between floating-point and 64-bit integer types), via inline code or called subroutines. 64-bit addressing is supported, using a register pair to hold the address. All this can easily be seen by inspecting the machine code (use `cuobjdump --dump-sass`). – njuffa Feb 15 '17 at 09:26

1 Answers1

0

I have coded two simple kernels. Each one adds two vectors of types int(32-bit)/long int (64 bit). It turns out on my GPU (Tesla K80), which happens to be pretty new and good one, the cores are just 32-bit.

The time roughly doubles as the vector size increases.

The kernels are as follows:

__global__ void add_32(int * c, int * a, int * b)
{
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    c[gid] = a[gid] + b[gid];
}

typedef long int int64;

__global__ void add_64(int64 * c, int64 * a, int64 * b)
{
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    c[gid] = a[gid] + b[gid];
}

When vector size is 1 Mega element, add_32 takes about 102.911 microsec, whereas add_64 takes 192.669 microsec. (Execution times were reported using Nvidia profiler while running the release-mode binary).

It seems that 64-bit instructions are just emulated via 32-bit instructions!

This could be a brute-force solution to find out what kind of machines are the GPU cores, but definitely not an elegant one.

Update:

Thanks to @Paul A. Clayton comment, it seems that the solution above is not a fair comparison as data size doubles in the 64-bit case. So we should not launch both kernels with the same number of elements. Correct principle would be to launch the 64-bit version with half the number of elements.

To be even more sure, let's consider element-wise vector multiplication instead of addition. If the GPU emulates 64-bit instructions via 32-bit instructions, then it needs at least 3 32-bit multiplication instructions to multiply 2 64-bit numbers using maybe Karatsuba algorithm for instance. This implies that if we launch the 64-bit vector multiplication kernel with N/2 elements, it would take longer than the 32-bit kernel with N elements if 64-bit multiplications were just emulated.

Here are the kernels:

__global__ void mul_32(int * c, int * a, int * b)
{
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    c[gid] = a[gid] * b[gid];
}

typedef long int int64;

__global__ void mul_64(int64 * c, int64 * a, int64 * b)
{
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    c[gid] = a[gid] * b[gid];
}

And here are the experiment details: Times reported here are from nvidia profiler on release-mode binary: 1- Kernel mul_32 with vector size N = 256 Mega elements, takes 25.608 millisec. 2- Kernel mul_64 with vector size N = 128 Mega elements, takes 24.153 millisec.

I am aware that both kernels produce incorrect results, but I think that has nothing to do with the way computation is being done.

caesar
  • 181
  • 8
  • Measuring the element *throughput* of a vector add depending on operand size does not determine whether operations on a given element size are supported natively. With SIMD operations the throughput would naturally halve when operand size is doubled; e.g., a 512-bit SIMD would perform on 16 32-bit operations but only 8 64-bit operations. In addition, with a single stream operation, memory bandwidth rather than computational throughput would be measured. –  Feb 14 '17 at 14:26
  • I think you are right @PaulA.Clayton I have updated my solution with a different method that considers data size. – caesar Feb 15 '17 at 08:20