Briefly: I don't know of a method to access these IDs directly, but you can give streams explicit names for profiling purposes.
cudaStream_t
is an opaque "resource handle" type. A resource handle is something like a pointer; so it stands to reason that the stream ID is not contained in the pointer (handle) itself, but somehow in what it refers to.
Since it is opaque (no definition of what it points to, provided by CUDA) and as you point out there is no direct API for this, I don't think you'll find a method to extract the stream ID from a cudaStream_t
at runtime.
For these assertions that cudaStream_t
is a resource handle and that it is opaque, refer to the CUDA header file driver_types.h
However, the NV Tools Extension API gives you the capability to "name" a particular stream (or other resources). This would allow you to associate a particular stream in source code with a particular name in the profiler.
Here's a trivial worked example:
$ cat t138.cu
#include <stdio.h>
#include <nvToolsExtCudaRt.h>
const long tdel = 1000000000ULL;
__global__ void tkernel(){
long st = clock64();
while (clock64() < st+tdel);
}
int main(){
cudaStream_t s1, s2, s3, s4;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaStreamCreate(&s4);
#ifdef USE_S_NAMES
nvtxNameCudaStreamA(s1, "stream 1");
nvtxNameCudaStreamA(s2, "stream 2");
nvtxNameCudaStreamA(s3, "stream 3");
nvtxNameCudaStreamA(s4, "stream 4");
#endif
tkernel<<<1,1,0,s1>>>();
tkernel<<<1,1,0,s2>>>();
tkernel<<<1,1,0,s3>>>();
tkernel<<<1,1,0,s4>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt
$ nvprof --print-gpu-trace ./t138
==28720== NVPROF is profiling process 28720, command: ./t138
==28720== Profiling application: ./t138
==28720== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
464.80ms 622.06ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 13 tkernel(void) [393]
464.81ms 621.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 14 tkernel(void) [395]
464.82ms 623.30ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 15 tkernel(void) [397]
464.82ms 622.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 16 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt -DUSE_S_NAMES
$ nvprof --print-gpu-trace ./t138
==28799== NVPROF is profiling process 28799, command: ./t138
==28799== Profiling application: ./t138
==28799== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
457.98ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 1 tkernel(void) [393]
457.99ms 544.31ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 2 tkernel(void) [395]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 3 tkernel(void) [397]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 4 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$