I have coded two simple kernels. Each one adds two vectors of types int(32-bit)/long int (64 bit). It turns out on my GPU (Tesla K80), which happens to be pretty new and good one, the cores are just 32-bit.
The time roughly doubles as the vector size increases.
The kernels are as follows:
__global__ void add_32(int * c, int * a, int * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] + b[gid];
}
typedef long int int64;
__global__ void add_64(int64 * c, int64 * a, int64 * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] + b[gid];
}
When vector size is 1 Mega element, add_32 takes about 102.911 microsec, whereas add_64 takes 192.669 microsec. (Execution times were reported using Nvidia profiler while running the release-mode binary).
It seems that 64-bit instructions are just emulated via 32-bit instructions!
This could be a brute-force solution to find out what kind of machines are the GPU cores, but definitely not an elegant one.
Update:
Thanks to @Paul A. Clayton comment, it seems that the solution above is not a fair comparison as data size doubles in the 64-bit case. So we should not launch both kernels with the same number of elements. Correct principle would be to launch the 64-bit version with half the number of elements.
To be even more sure, let's consider element-wise vector multiplication instead of addition. If the GPU emulates 64-bit instructions via 32-bit instructions, then it needs at least 3 32-bit multiplication instructions to multiply 2 64-bit numbers using maybe Karatsuba algorithm for instance. This implies that if we launch the 64-bit vector multiplication kernel with N/2 elements, it would take longer than the 32-bit kernel with N elements if 64-bit multiplications were just emulated.
Here are the kernels:
__global__ void mul_32(int * c, int * a, int * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] * b[gid];
}
typedef long int int64;
__global__ void mul_64(int64 * c, int64 * a, int64 * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] * b[gid];
}
And here are the experiment details:
Times reported here are from nvidia profiler on release-mode binary:
1- Kernel mul_32 with vector size N = 256 Mega elements, takes 25.608 millisec.
2- Kernel mul_64 with vector size N = 128 Mega elements, takes 24.153 millisec.
I am aware that both kernels produce incorrect results, but I think that has nothing to do with the way computation is being done.