CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)
Questions tagged [cuda-streams]
78 questions
10
votes
2 answers
CUDA streams not overlapping
I have something very similar to the code:
int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);
cudaMalloc(&g_in, size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);
for (k =…

pmcr
- 135
- 1
- 2
- 6
9
votes
2 answers
Multiple host threads launching individual CUDA kernels
For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal…

gmemon
- 2,573
- 5
- 32
- 37
8
votes
1 answer
How to reduce CUDA synchronize latency / delay
This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed…

shadow
- 141
- 1
- 7
7
votes
2 answers
CUDA Dynamic Parallelism, bad performance
We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach.
We made the simplest reproducible code to show this issue, which is to increment the value of…

Cristobal Navarro
- 326
- 2
- 12
5
votes
1 answer
What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?
Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams.
Here's an example:
App 0: issues kernels to logical stream 0;
App 1: issues kernels to (its own) logical stream 0.
In this case,
1)…

Covi
- 1,331
- 1
- 14
- 17
5
votes
2 answers
Are CUDA streams device-associated? And how do I get a stream's device?
I have a CUDA stream which someone handed to me - a cudaStream_t value. The CUDA Runtime API does not seem to indicate how I can obtain the index of the device with which this stream is associated.
Now, I know that cudaStream_t is just a pointer to…

einpoklum
- 118,144
- 57
- 340
- 684
5
votes
2 answers
Let nvidia K20c use old stream management way?
From K20 different streams becomes fully concurrent(used to be concurrent on the edge).
However My program need the old way. Or I need to do a lot of synchronization to solve the dependency problem.
Is it possible to switch stream management to the…

worldterminator
- 2,968
- 6
- 33
- 52
4
votes
1 answer
What is the difference between Nvidia Hyper Q and Nvidia Streams?
I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A.…

sandeep.ganage
- 1,409
- 2
- 21
- 47
4
votes
1 answer
How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream
I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams.
As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream)…

Yangsong Zhang
- 71
- 6
4
votes
1 answer
Is GTX 680 Capable of Concurrent Data Transfer
I expected that GTX 680 (which is one of the latest version of GPUs) is capable of concurrent data transfer (concurrent data transfer in both direction). But when I run cuda SDK "Device Query", the test result of the term "Concurrent copy and…

Blue_Black
- 307
- 1
- 3
- 11
3
votes
1 answer
Can we overlap compute operation with memory operation without pinned memory on CPU?
I`m trying to overlap the computation and memory operation with HuggingFace SwitchTransformer.
Here’s a detailed explanation.
The memory operation is for data movement from CPU to GPU, and its size is 4MB per block.
The number of blocks is variable…

Ryan
- 73
- 7
3
votes
1 answer
What's the capacity of a CUDA stream (=queue)?
A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks...
But - these queues don't have infinite capacity. In fact, empirically, I find that this limit is not super-high, e.g. in the thousands, not…

einpoklum
- 118,144
- 57
- 340
- 684
3
votes
0 answers
Execute another model in parallel to a model's forward pass with PyTorch
I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ResNet block as an input and makes some auxiliary…

jallikattu
- 31
- 2
3
votes
1 answer
Concurrency of one large kernel with many small kernels and memcopys (CUDA)
I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of…

Lenz
- 81
- 5
3
votes
5 answers
Get rid of busy waiting during asynchronous cuda stream executions
I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
…

kokosing
- 5,251
- 5
- 37
- 50