0

I am trying to multiply matrix A (n times) with matrix B. I have used kernel for matrix multiplication and using stream to do this multiplication N times. I have 3 conditions to test consequently. My 1st condition is running successfully.

I don't know why it is showing error of "Invalid Argument" in the second condition iteration. I am guessing the I am not properly cleaning my memory. I have done my best to free all host and device variables. Also tried CUDA device reset, nothing helps. Can anyone help me debug this?

Please find the portion of my code here:

int main(){
    
    
    for (int i = 0; i < 3; i++) {
        
      
      for (int ind = 0; ind < itr; ind++){
          cudaStreamCreate(&(stream[ind]));
      }
      cudaCheckErrors("cudaStreamCreate fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind]);
      }
      cudaDeviceSynchronize();

      for (int ind = 0; ind < itr; ind++){
          // Launch our kernel
          matrixMul<<<BLOCKS, THREADS, 0, stream[ind]>>>(d_a[ind], b, d_c[ind], M, K, N);
      }
      cudaDeviceSynchronize();
      cudaCheckErrors("kernel fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(h_c[ind], d_c[ind], bytes_c, cudaMemcpyDeviceToHost, stream[ind]);
      }

      for (int ind = 0; ind < itr; ind++){
          cudaStreamSynchronize(stream[ind]);
      }
        
      cudaEventRecord( stop, 0 );
      cudaEventSynchronize( stop );

      cudaEventDestroy( start );
      cudaEventDestroy( stop);

      // Free allocated memory ****The issue was here.******
      cudaFreeHost(h_a);
      cudaFree(b);
      cudaFreeHost(h_c);
      cudaFree(d_a);
      cudaFree(d_c);
      cudaDeviceReset();
    }

    return 0;
}

In second iteration I was getting error as:

Fatal error: cudaStreamCreate fail (invalid argument at /tmp/tmpwgpzgk9m/73a7502c-7662-4e80-804e-4debff15dc45.cu:140)
*** FAILED - ABORTING

SOlved:

The error was coming due to memory leakage. I was allocating the array pointers but was only freeing 1st one. As per suggestion from below answer from Robert, the memory should be for each index of the array. And also please always use proper error in cuda like this

.

Alankrit
  • 658
  • 1
  • 6
  • 17
  • You're doing absolutely no error checking while you're processing. You check none of the values returned from `malloc()`, for example. – Andrew Henle Nov 17 '20 at 19:42
  • I could not get, what you said @AndrewHenle. Could you please elaborate? I would really appreciate that. – Alankrit Nov 17 '20 at 19:45
  • I checked the from the beginning of the next iteration of s. It is showing error from the beginning. But, I can't figure out why. Is there any memory leakage? – Alankrit Nov 17 '20 at 19:58
  • 1. Implement [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). Use it on **every** cuda call. 2. run your code with `cuda-memcheck`. There are a variety of issues with your code. I'll pick just one: you don't use `cudaFreeHost` on a pointer returned by `malloc`, or on a pointer that is actually a stack array. Your usage of `cudaFree` is also incorrect, and so you have lots of memory leaking going on. If you are doing allocations in a loop, you are likely going to have to free in a loop. – Robert Crovella Nov 17 '20 at 20:09
  • Hi @RobertCrovella, How can I free a loop? I am sorry, I am just learning the language. Please help. – Alankrit Nov 17 '20 at 20:28
  • Read my answer. – Robert Crovella Nov 17 '20 at 20:31

1 Answers1

1

Suggestion: Implement proper CUDA error checking. Use it on every cuda call. Your haphazard use of the error checking macro makes for a confusing output that seems to suggest a problem with stream creation.

That is not the case. The invalid argument is arising from your freeing operations at the end of the loop. You have a number of errors:

  1. We don't don't use cudaFreeHost on a pointer returned by malloc, or on a pointer that is actually a stack array.
  2. You don't use cudaFree on a pointer that is actually a stack array.
  3. If you have done allocations in a loop, you are likely going to have to do free operations in a loop.
  4. Even with your use of cudaDeviceReset (which frees all device allocations anyway), you have a memory leak because of improper freeing of the malloc allocations.

By modifying the end of your code as follows:

  ...
  cudaEventDestroy( start );
  cudaEventDestroy( stop);

  for (int ind = 0; ind < itr; ind++){
      free(h_a[ind]);
      free(h_c[ind]);
      cudaFree(d_a[ind]);
      cudaFree(d_c[ind]);
  }
  // Free allocated memory
  cudaFree(b);
  cudaDeviceReset();
}
...

I was able to make the above errors disappear.

As an aside, it should not be necessary to create 5000 streams, but it appears to work so I'll leave it at that. I would normally advise stream reuse.

Stream reuse could look something like this. Instead of creating 5000 streams, pick a smaller number, like 5 (the exact number shouldn't matter much here. It's likely that anything in the range of 3 or greater will behave similarly).

  1. Create that many streams:

       const int max_streams = 5;
       for (int ind = 0; ind < max_streams; ind++){
           cudaStreamCreate(&(stream[ind]));
       }
    
  2. When it comes to using the streams, use modulo arithmetic to "rotate" through the streams:

     for (int ind = 0; ind < itr; ind++){
       cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind%max_streams]);
     }
     cudaDeviceSynchronize();
    
     for (int ind = 0; ind < itr; ind++){
         // Launch our kernel
       matrixMul<<<BLOCKS, THREADS, 0, stream[ind%max_streams]>>>(d_a[ind], b, d_c[ind], M, K, N);
     }
     cudaDeviceSynchronize();
    ...
    
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The freeing of memory in a loop actually worked. Thanks a lot @RobertCrovella :). I have been stuck here for hours. Thank you very much. – Alankrit Nov 17 '20 at 20:37
  • 1) Do you mind advising me how to reuse streams in my case? 2)And can I convert my init_matrix() function to a kernel? If so... would rand() function works inside a kernel? – Alankrit Nov 17 '20 at 20:40
  • I've added a section to talk about stream reuse to my answer. No, `rand()` cannot be used like that in kernel code. I'm unlikely to respond to further unrelated questions here. – Robert Crovella Nov 18 '20 at 01:53
  • Thank you so much Robert for your kind help. I have learned a lot about streams and cuda from this. I am sure this thread will help future learners/readers as well. – Alankrit Nov 18 '20 at 06:09