4

I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes.

// This does pukes and gives CUDA_ERROR_INVALID_VALUE
var hostOutput = new Array[Int](numElements)
cuMemcpyDtoH(
  Pointer.to(hostOutput),
  deviceOutput,
  8
)

// This runs just fine
var hostOutput = new Array[Int](numElements)
cuMemcpyDtoH(
  Pointer.to(hostOutput),
  deviceOutput,
  4
)

To give better context of the actual program bellow is my kernel code which compiles and runs just fine:

extern "C"
__global__ void add(int n, int *a, int *b, int *sum) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    {
        sum[i] = a[i] + b[i];
    }
}

Also I then translated some java sample code into my scala code. Anyway bellow is the entire program that runs:

package dev

import jcuda.driver.JCudaDriver._

import jcuda._
import jcuda.driver._
import jcuda.runtime._

/**
 * Created by dev on 6/7/15.
 */
object TestCuda {
  def init = {
    JCudaDriver.setExceptionsEnabled(true)

    // Input vector

    // Output vector

    // Load module
    // Load the ptx file.

    val kernelPath = "/home/dev/IdeaProjects/jniopencl/src/main/resources/kernels/JCudaVectorAddKernel30.cubin"

    cuInit(0)

    val device = new CUdevice
    cuDeviceGet(device, 0)
    val context = new CUcontext
    cuCtxCreate(context, 0, device)

    // Create and load module
    val module = new CUmodule()
    cuModuleLoad(module, kernelPath)

    // Obtain a function pointer to the kernel function.
    var add = new CUfunction()
    cuModuleGetFunction(add, module, "add")

    val numElements = 100000

    val hostInputA = 1 to numElements toArray
    val hostInputB = 1 to numElements toArray
    val SI: Int = Sizeof.INT.asInstanceOf[Int]

    // Allocate the device input data, and copy
    // the host input data to the device
    var deviceInputA = new CUdeviceptr
    cuMemAlloc(deviceInputA, numElements * SI)
    cuMemcpyHtoD(
      deviceInputA,
      Pointer.to(hostInputA),
      numElements * SI
    )

    var deviceInputB = new CUdeviceptr
    cuMemAlloc(deviceInputB, numElements * SI)
    cuMemcpyHtoD(
      deviceInputB,
      Pointer.to(hostInputB),
      numElements * SI
    )

    // Allocate device output memory
    val deviceOutput = new CUdeviceptr()
    cuMemAlloc(deviceOutput, SI)

    // Set up the kernel parameters: A pointer to an array
    // of pointers which point to the actual values.
    val kernelParameters = Pointer.to(
      Pointer.to(Array[Int](numElements)),
      Pointer.to(deviceInputA),
      Pointer.to(deviceInputB),
      Pointer.to(deviceOutput)
    )

    // Call the kernel function
    val blockSizeX = 256
    val gridSizeX = Math.ceil(numElements / blockSizeX).asInstanceOf[Int]
    cuLaunchKernel(
      add,
      gridSizeX, 1, 1,
      blockSizeX, 1, 1,
      0, null,
      kernelParameters, null
    )

    cuCtxSynchronize

    // **** Code pukes here with that error
    // If I comment this out the program runs fine
    var hostOutput = new Array[Int](numElements)
    cuMemcpyDtoH(
      Pointer.to(hostOutput),
      deviceOutput,
      numElements
    )

    hostOutput.foreach(print(_))
  }
}

Anyway, just to let you know the specs of my computer. I'm running Ubuntu 14.04 on an optimus setup with a GTX 770M card which is compute 3.0 capable. I'm also running NVCC version 5.5. Lastly I'm running scala version 2.11.6 with Java 8. I'm a noob and would greatly appreciate any help.

Dr.Knowitall
  • 10,080
  • 23
  • 82
  • 133
  • I don't know how to do [CUDA error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api/14038590#14038590) in jcuda, but I guess you should check errors after each CUDA API call to make sure that you identify the right spot where the error really occurs – m.s. Jun 12 '15 at 07:23
  • @m.s. The basic error checks (i.e. the function return values) are automatically checked when `setExceptionsEnabled(true)` was set. – Marco13 Jun 12 '15 at 08:07

1 Answers1

3

Here

val deviceOutput = new CUdeviceptr()
cuMemAlloc(deviceOutput, SI)

you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be

cuMemAlloc(deviceOutput, SI * numElements)

And similarly, I think that the call in question should be

cuMemcpyDtoH(
  Pointer.to(hostOutput),
  deviceOutput,
  numElements * SI
)

(note the * SI for the last parameter).

Marco13
  • 53,703
  • 9
  • 80
  • 159
  • Ah, thank you so much!!! It's too easy to overlook these things. I was going to start using cuda gdb but is there a good way to debug native host code as well? – Dr.Knowitall Jun 12 '15 at 16:20
  • @Mr.Student In fact, the most important thing on API level is `setExceptionsEnabled(true)`. In this case, it already pointed you to the right line, and carefully examining the parameters would have revealed the error (I'm not sure in how far a tool could help here beyond that...). Debugging kernels is a different story (and unfortunately, much harder with Java/JCuda than with the tools that are available for native CUDA (NVIDIA NSight) - also see http://www.jcuda.org/debugging/Debugging.html ) – Marco13 Jun 12 '15 at 16:23
  • Thanks I appreciate all your help Marco – Dr.Knowitall Jun 13 '15 at 06:11