0

In ordinary C++, if I say the following, it is safe because the third clause will be skipped execution. I'm just wondering if cuda kernel code will also have this property or it doesn't in consideration of maximizing parallelism?

int x[100] = {...}, i = -1;
if (i < 0 || i >= 100 || x[i] == 0) {
  // do something.
}

EDIT:

Taking from Jack's program, the following program runs OK and output "10". There is no error doing cuda-memcheck.

#include <stdio.h>

__global__ void test(float *input, float *output, int i, int N) {
    float c = 10;

    // NOTE: uncomment this will cause cuda-memcheck to give an error.
    // c = input[-1];

    if (i < 0 || i >= N || (c = input[-1])) {
        output[0] = c;
    }
}

int main(void) {

    int i = -1;
    int N = 10;

    float* input;
    float* output;
    float* dev_input;
    float* dev_output;

    input = (float*)malloc(sizeof(float) * N);
    output = (float*)malloc(sizeof(float));
    for (int j = 0; j < N; j++) {
      input[j] = 2.0f;
    }
    output[0] = 3.0f;

    cudaMalloc((void**)&dev_input,sizeof(float) * N);
    cudaMalloc((void**)&dev_output,sizeof(float));

    cudaMemcpy(dev_input,input,sizeof(float) * N,cudaMemcpyHostToDevice);
    cudaMemcpy(dev_output,output,sizeof(float),cudaMemcpyHostToDevice);

    test<<<1,1>>>(dev_input,dev_output,i,N);

    cudaMemcpy(output,dev_output,sizeof(float),cudaMemcpyDeviceToHost);

    printf("%f\n", output[0]);
    return 0;
}
shaoyl85
  • 1,854
  • 18
  • 30
  • Your disassembled code will look different than that provided by @JackOLantern, and it will contain the necessary sequence points to prevent the unconditional load he has observed. `input[-1]` is not loaded unconditionally in your case. – Robert Crovella Mar 04 '14 at 19:45
  • Then why is Jack's code `if (i < N || input[i] == 0)` loaded unconditionally? I can't figure out the difference? – shaoyl85 Mar 05 '14 at 04:20
  • There are two entry possibilities for Jack's code. Either `i` is less than `N`, or `i` is not less than `N`. Take each case, walk all the way through Jack's code, and ask yourself "will `input[i]` get loaded?" The answer is yes in both cases. Therefore the compiler can load it unconditionally. If `i` is not less than `N`, the second part of the if clause (`input[i] == 0`) gets evaluated, forcing a load of `input[i]`. If `i` is less than `N`, the remainder of the if statement `output[i] = input[i];` gets executed, forcing a load of `input[i]`. Therefore `input[i]` is loaded in each case. – Robert Crovella Mar 05 '14 at 04:53
  • Wow, that is interesting. I never thought the compilers today are this smart. – shaoyl85 Mar 05 '14 at 05:01

2 Answers2

1

Try the simple code below in which the kernel function tries to access input[-1]. You will realize it will get stuck.

#include <stdio.h>

__global__ void test(float *input, float *output, int i, int N) {

    if (i < N || input[i] == 0) {
        output[i] = input[i];
    }
}

void main(void) {

    int i = -1;
    int N = 10;

    float* input;
    float* dev_input;
    float* dev_output;

    input = (float*)malloc(sizeof(float));
    input[0] = 2.f;

    cudaMalloc((void**)&dev_input,sizeof(float));
    cudaMalloc((void**)&dev_output,sizeof(float));

    cudaMemcpy(dev_input,input,sizeof(float),cudaMemcpyHostToDevice);

    test<<<1,1>>>(dev_input,dev_output,i,N);
}

The reason can be explained by having a look at the disassembled code.

   MOV R1, c[0x1][0x100];                              R1 = c[0x1][0x100]
   NOP;
   MOV R3, c[0x0][0x28];                               R3 = c[0x0][0x28]
   SHL R2, R3, 0x2;                                    R2 = shiftleft(R3)           
   IADD R0, R2, c[0x0][0x20];                          R0 = R2 + 0x2
   LDU R0, [R0];                                       Load the memory addressed by R0 to R0
   FSETP.EQ.AND P0, PT, R0, RZ, PT;                    Predicate register P0 will contain result of test R0 == 0
   ISETP.LT.OR P0, PT, R3, c[0x0][0x2c], P0;           Predicate register P0 will contain result of test P0 || (R3 < c[0x0][0x2c])
@P0 IADD R2, R2, c[0x0][0x24];                         ...
@P0 ST [R2], R0;
   EXIT ;

As you can see, the device will attempt to load the data from global memory regardless from the result of the first clause.

Vitality
  • 20,705
  • 4
  • 108
  • 146
  • Thank you! I tried the program but the conclusion seems to be opposite, please see my edits. – shaoyl85 Mar 04 '14 at 19:06
  • I'm not sure what claims or assertions are being put forward by this answer or what answer it is providing to the question posed. However since it is pointed out that the global memory load occurs regardless, let's consider whether or not `input[i]` will be loaded if the first condition (`i < N`) is false. (It will.) Then consider whether `input[i]` will be loaded if the first condition is true. *It will also be loaded*, by the next line of the kernel. Therefore the compiler is allowed to generate the load unconditionally. – Robert Crovella Mar 04 '14 at 19:51
  • @RobertCrovella I was just trying to point out that performing operations like `if (i < 0 || i >= 100 || x[i] == 0) { // do something. }` as indicated by the OP isn't safe. From the disassembled code, I would conclude that `input[i]` is being loaded unconditionally which, from your comment, seems to be also your conclusion. Please, point out any wrong statement/conclusion or misinterpretation of the OP's query and I will amend/remove my post. – Vitality Mar 04 '14 at 21:29
  • I'm unable to see how your code demonstrates that `if (i < 0 || i >= 100 || x[i] == 0) { ` is unsafe. For the input of `i=-1`, your code is unsafe even if your if condition is `if (i < N) {`. I guess you are saying that you are answering a question "Is it safe?" which isn't really what the OP asked. The OP asked "Will CUDA evaluate certain clauses or not?" Furthermore, the OP seems to have interpreted your answer as an answer to OP's question (which it certainly is not), as evidenced by what appears to be confusion in OP's comments and posting edits ("the conclusion seems to be opposite"). – Robert Crovella Mar 04 '14 at 21:39
  • @RobertCrovella I have asked the SO team to remove this answer. – Vitality Mar 04 '14 at 21:49
  • Wait a minute, please don't delete the post before I understand it. I'm still struggling on understanding the answer and the comments here. :) – shaoyl85 Mar 05 '14 at 04:16
  • @JackOLantern is it that the load in your disassembled code is generated from the the body of the if statement rather than the clause in the condition? I mean from `output[i] = input[i];`? If so, then where is the conditional load assembled from the if condition? – shaoyl85 Mar 05 '14 at 04:26
  • The compiler doesn't need to generate two separate loads in this case. The compiler has observed that it will load it either way (no matter what) therefore it loads it only once. Optimization! – Robert Crovella Mar 05 '14 at 04:54
  • Nobody needs to delete any answers. The question would be more confusing now if this answer got deleted. – Robert Crovella Mar 05 '14 at 04:55
  • @shaoyl85 As noticed by @RobertCrovella, in the example above, the body of the `if` statement forced the load of `input[i]` in any case and triggered the compiler optimization. You can verify that in my code the unconditional load will disappear if `output[i] = input[i]` is changed to `output[i] = 3`. When writing the example yesterday, I didn't realize that I was involving the load to `input[i]` in the `if` statement body also. Mea culpa. – Vitality Mar 05 '14 at 06:31
1

The CUDA C/C++ compiler should obey the language requirements in this respect.

Specifically, the language requirements as far as order-of-operation and short-circuiting should be maintained for non-overloaded || and && operators.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257