The kernel can read the special register %gridid. %gridid is unique per launch. If performance then a simple kernel prolog can have one thread from each kernel launch output the gridid global function map using func and %gridid. Alternatively, the CUPTI SDK Activity API can be used to collect this information. The CUpti_ActivityKernel2 event contains per launch meta-data including the gridId and CUfunction name.
Here is an example reading %gridid.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdint.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
static __device__ __inline__ uint64_t __gridid()
{
uint64_t gridid;
asm volatile("mov.u64 %0, %%gridid;" : "=l"(gridid));
return gridid;
}
__device__ void devPrintName()
{
static const char* name = __func__;
printf("%llu %s\n", __gridid(), name);
}
__global__ void globPrintName()
{
static const char* name = __func__;
printf("%llu %s\n", __gridid(), name);
devPrintName();
}
int main()
{
for (int i = 0; i < 4; ++i)
{
globPrintName<<<1,1,0>>>();
cudaDeviceReset();
}
return 0;
}
This sample outputs
1 globPrintName
1 devPrintName
2 globPrintName
2 devPrintName
3 globPrintName
3 devPrintName
4 globPrintName
4 devPrintName