Questions tagged [dynamic-parallelism]

dynamic parallelism refers to a capability in CUDA for device kernel launches to be performed from within a device kernel

This tag should be used for questions pertaining to CUDA dynamic parallelism. This refers to the capability for CUDA devices of compute capability 3.5 or higher to be able to launch a device kernel from within a device kernel. In addition, using this functionality requires the specification of certain CUDA compilation switches, such as the switch to enable relocatable device code, and the switch to link in the device runtime library.

50 questions
9
votes
1 answer

compilation .cu files with Dynamic Parallelism(CUDA)

I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (not necessary, just there is a global kernel…
Mikhail Genkin
  • 3,247
  • 4
  • 27
  • 47
7
votes
0 answers

AleaGPU Dynamic Parallelism in F#? How?

This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the post:(Does Alea GPU support dynamic parallelism?),…
7
votes
2 answers

CUDA Dynamic Parallelism, bad performance

We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach. We made the simplest reproducible code to show this issue, which is to increment the value of…
7
votes
1 answer

CUDA device runtime api cudaMemsetAsync doesn't work

I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0. Here is my test code: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include…
Xiang Zhang
  • 2,831
  • 20
  • 40
3
votes
1 answer

Dynamic Parallelism on GTX 980 ti: Unknown Error

I am attempting dynamic parallelism on a GTX 980 ti card. All attempts at running code return "unknown error". Simple code is shown below with compilation options. I can execute kernels at depth=0 with no issues. The first time a child is called,…
AshleyG
  • 31
  • 2
3
votes
1 answer

What factors effect the overhead of dynamic parallelism kernel launches?

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, occupancy of the SM where the kernel is being…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
1 answer

How to perform relational join on two data containers on GPU (preferably CUDA)?

What I'm trying to do: On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm wanting to perform an Inner Join. Imagine two…
aiwyn
  • 268
  • 2
  • 9
3
votes
1 answer

CUDA dynamic parallelism with Driver API

I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110. I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, sm_35 into a ptx file and then the CUDA linker adds…
FHoenig
  • 349
  • 1
  • 10
2
votes
1 answer

"unknown error" on first cudaMalloc if CUBLAS is present in kernel

I have the following minimal .cu file #include #include #include __global__ void test() { cublasHandle_t handle = nullptr; cublasCreate(&handle); } int main(int, char**) { void * data =…
Joe
  • 6,497
  • 4
  • 29
  • 55
2
votes
1 answer

CUDA - How to make thread in kernel wait for it's children

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait for it's children computation, since…
2
votes
0 answers

Does nvcc support tail call optimization in dynamic parallelism?

Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned: "An optimization is permitted where the system detects that it need not reserve space for the parent's state in cases where the parent kernel…
2
votes
1 answer

Dynamic parallelism - launching many small kernels is very slow

I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first launch a coarse (or low resolution) kernel,…
labotsirc
  • 722
  • 7
  • 21
2
votes
3 answers

Kepler CUDA dynamic parallelism and thread divergence

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved? It allows recursion and lunching kernel from device…
HooYao
  • 554
  • 5
  • 19
1
vote
1 answer

Why can't I link to my CUDA static library that uses Dynamic Parallelism and Separable Compilation?

I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a Visual Studio solution. I'm using CMake 3.21.3,…
Justin
  • 1,881
  • 4
  • 20
  • 40
1
vote
1 answer

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error: __kernel void testDynamic(__global int *data) { int id=get_global_id(0); atomic_add(&data[1],2); } __kernel void test(__global int * data) { int id=get_global_id(0); atomic_add(&data[0],2); …
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
1
2 3 4