Questions tagged [cuda-streams]

CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)

78 questions
10
votes
2 answers

CUDA streams not overlapping

I have something very similar to the code: int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k =…
pmcr
  • 135
  • 1
  • 2
  • 6
9
votes
2 answers

Multiple host threads launching individual CUDA kernels

For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal…
gmemon
  • 2,573
  • 5
  • 32
  • 37
8
votes
1 answer

How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty. I noticed…
shadow
  • 141
  • 1
  • 7
7
votes
2 answers

CUDA Dynamic Parallelism, bad performance

We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach. We made the simplest reproducible code to show this issue, which is to increment the value of…
5
votes
1 answer

What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?

Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams. Here's an example: App 0: issues kernels to logical stream 0; App 1: issues kernels to (its own) logical stream 0. In this case, 1)…
Covi
  • 1,331
  • 1
  • 14
  • 17
5
votes
2 answers

Are CUDA streams device-associated? And how do I get a stream's device?

I have a CUDA stream which someone handed to me - a cudaStream_t value. The CUDA Runtime API does not seem to indicate how I can obtain the index of the device with which this stream is associated. Now, I know that cudaStream_t is just a pointer to…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
5
votes
2 answers

Let nvidia K20c use old stream management way?

From K20 different streams becomes fully concurrent(used to be concurrent on the edge). However My program need the old way. Or I need to do a lot of synchronization to solve the dependency problem. Is it possible to switch stream management to the…
worldterminator
  • 2,968
  • 6
  • 33
  • 52
4
votes
1 answer

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more. I was going through one article and it had these two statements: A.…
sandeep.ganage
  • 1,409
  • 2
  • 21
  • 47
4
votes
1 answer

How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams. As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream)…
4
votes
1 answer

Is GTX 680 Capable of Concurrent Data Transfer

I expected that GTX 680 (which is one of the latest version of GPUs) is capable of concurrent data transfer (concurrent data transfer in both direction). But when I run cuda SDK "Device Query", the test result of the term "Concurrent copy and…
Blue_Black
  • 307
  • 1
  • 3
  • 11
3
votes
1 answer

Can we overlap compute operation with memory operation without pinned memory on CPU?

I`m trying to overlap the computation and memory operation with HuggingFace SwitchTransformer. Here’s a detailed explanation. The memory operation is for data movement from CPU to GPU, and its size is 4MB per block. The number of blocks is variable…
Ryan
  • 73
  • 7
3
votes
1 answer

What's the capacity of a CUDA stream (=queue)?

A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks... But - these queues don't have infinite capacity. In fact, empirically, I find that this limit is not super-high, e.g. in the thousands, not…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
0 answers

Execute another model in parallel to a model's forward pass with PyTorch

I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ResNet block as an input and makes some auxiliary…
3
votes
1 answer

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of…
Lenz
  • 81
  • 5
3
votes
5 answers

Get rid of busy waiting during asynchronous cuda stream executions

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs): cudaStream_t steams[S_N]; for (int i = 0; i < S_N; i++) { …
kokosing
  • 5,251
  • 5
  • 37
  • 50
1
2 3 4 5 6