cudaErrorIllegalInstruction on recursive function call

Question

I am writing a path tracer for GPU using CUDA 10.2. The entire program ran fine until i added a recursive call to the trace function. nvcc still compiles it, although with the warning: "Severity Code Description Project File Line Suppression State Warning Stack size for entry function '' cannot be statically determined". When the GPU reaches the point it stops and the next time CPU gets an cudaError from an API call it is cuda error 715, which is cudaErrorIllegalInstruction. I tried recreating the issue by writing another recursive kernel/function pair, and the compiler gave the same warning, but it executed expectedly. Unfortunately this means i have to dump my entire function here (if there are any questions to the functions and types used i will happily answer them):

__device__ Vec3 trace(
    const Settings& settings,
    const Ray& r,
    const Shape* shapes,
    const size_t nshapes,
    uint8_t bounces,
    curandState& randState) {

    if (bounces >= settings.maxBounces) {
        return Vec3(0.0f);
    }

    const Shape* shape = nullptr;
    float t = inf;
    bool flipNormal;

    float dist;

    for (size_t i = 0; i < nshapes; i++) {
        if (shapes[i].intersect(r, dist, flipNormal) && dist < t) {
            shape = shapes + i;
            t = dist;
        }
    }

    if (shape == nullptr) 
        return settings.background;

    const Vec3 hitPos = r.ori + t * r.dir;
    const Vec3 normal = flipNormal ? -shape->normal(hitPos) : shape->normal(hitPos);

    const Vec3 hemiDir = cosineSample(normal, randState);

    const Vec3 traceCol = trace(
        settings,
        Ray(hitPos + normal * settings.bias, hemiDir),
        shapes,
        nshapes,
        bounces + 1,
        randState
        );

    return shape->surface.emittance + shape->surface.color * traceCol;

}

Has anyone else had this issue and in that case, how was it fixed? I could probably redesign to a non-recursive design, although it wouldn't be an optimal solution. I don't even know where to start with debugging this issue, so any ideas are greatly appreciated.

Are you sure you are not just running out of runtime heap space? — talonmies, Mar 26 '20 at 16:36
Its possible, but would the error not say something like that then? Besides, i tried swapping `bounces + 1` with `settings.maxBounces`, so the function would only run twice, and i still got the same error. — Tegon McCloud, Mar 26 '20 at 16:47
Recursion will have limits. Most recursive algorithms can be realized non-recursively. To continue debug of the illegal instruction error you could try the method described [here](https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218). That might help to localize the error. However if you are running out of stack space (a possibility in recursion) it will probably be difficult to "debug". You might try increasing [the stack size](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#configuration-options) — Robert Crovella, Mar 26 '20 at 17:10
Questions seeking debugging assistance are expected to include a [mcve] — Robert Crovella, Mar 26 '20 at 17:25
Thank you for the suggestion. The reason that i did not include a MRE is that i am unable to reproduce the error in other situations than this, effectively making the included code my MRE. — Tegon McCloud, Mar 26 '20 at 17:37
I can't reproduce your observation based on what you have provided. Your posted code doesn't compile, for example. The point here is that a MRE should be a **complete** code, among other things. Anyway I've voted to close for this reason. If you read the link you'll get an idea of the differences between the stated expectations and what you have provided. — Robert Crovella, Mar 26 '20 at 17:50
Interestingly, running the program with cuda-memcheck and through vs gave different errors. cuda-memcheck resulted in a cudaErrorLaunchFailure with the message "unspecified launch failure", so unfortunately I cannot obtain a stacktrace in that way. ): — Tegon McCloud, Mar 26 '20 at 17:51
I guess you are correct about the MRE, but would you rather have that i just dump thousands of lines of code in a SO question? — Tegon McCloud, Mar 26 '20 at 17:53
Yes, it is a challenging requirement. Now that you have understood the concept behind complete, the next thing to tackle is minimal. This is also challenging. I'm not suggesting the effort to create a MRE is a trivial matter. Almost certainly it does not require 10,000 lines of code to demonstrate the issue (with a complete code) but reducing from 10,000 lines of code to something smaller will certainly require effort on your part. And, yes, I for one would rather have 10,000 lines that is a complete code, than something that is not. However I doubt this is a universally held opinion on SO. — Robert Crovella, Mar 26 '20 at 17:57
Supposed I asked you to debug a code but said "you're not allowed to use any tools. Just your eyes and brain. Furthermore, the problem might not be in the code I'm showing you." Then, in another example, I said, "You can use any tools you want - the compiler, debugger, runtime checkers, static analyzers, whatever. And the problem is definitely in the code I've given you." Which method do you think is more powerful or that you would prefer? I would prefer the latter. The latter is enabled by provide a **complete** code. Do as you wish, of course. Just sharing my viewpoint. — Robert Crovella, Mar 26 '20 at 17:59
The problem with 10k lines of code is that it is unlikely that the question is going to help anyone else but the asker. However, the sole process of narrowing the code from 10k to something more manageable usually helps finding the bug. The function above seems to be short enough, do you think it would be possible to put it in a separate project and fill in the blanks with something simple so that it can be compiled? — CygnusX1, Mar 26 '20 at 18:16
Yes, and now i did, the problem is 100% stacksize limitations and it is just a really bad error message. When i tried to create a MRE i just underestimated how big the numbers had to be for CUDA to complain, but i made another much smaller program have the same issue. I can edit the question for future people having similar issues, although im fairly confident i can fix it myself now. — Tegon McCloud, Mar 26 '20 at 18:24

score 0 · Accepted Answer · answered Mar 26 '20 at 18:42

The problem is that CUDA usually selects a fitting max stack size for a kernel call, but it is unable to because nvcc cannot predict the necessary size for a recursive functions.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>

#include <stdint.h>

__device__ int recurse(uint64_t n, uint64_t max) {
    if (n < max)
        return recurse(n + 1, max);
    else 
        return n;
}

__global__ void start(uint64_t max) {
    uint32_t idx = threadIdx.x + (blockIdx.x * blockDim.x);

    if(idx == 256 * 256 - 1)
        printf("%i: %i\n", idx, recurse(0, max));

    return;
}

int main() {

    cudaError_t status;

    status = cudaSetDevice(0);
    if (status != cudaSuccess) {
        std::cerr << "failed: " << cudaGetErrorString(status) << std::endl;
        return status;
    }

    cudaThreadSetLimit(cudaLimitStackSize, 2048);

    start<<<256, 256>>>(126);

    status = cudaDeviceSynchronize();
    if (status != cudaSuccess) {
        std::cerr << "failed: " << cudaGetErrorString(status) << std::endl;
        return status;
    }

    return 0;

}

This program will run, but if 2048 is replaced with 1024, it will output the cudaErrorIllegalInstruction.

`cudaThreadSetLimit` is [deprecated](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__THREAD__DEPRECATED.html). If you follow [the link I gave you](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#configuration-options) to the current documentation, you will find the updated API call. — Robert Crovella, Mar 26 '20 at 18:54

cudaErrorIllegalInstruction on recursive function call

1 Answers1