cudaMemcpyToSymbol vs cudaMemcpy

Question

I'm trying to figure out why cudaMemcpyToSymbol is not working for me. (But cudaMemcpy does.)

// symbols:
__constant__ float flt[480];   // 1920 bytes
__constant__ int   ints[160];  // 640 bytes

// func code follows:
float* pFlts;
cudaMalloc((void**)&pFlts, 1920+640);  // chunk of gpu mem  (floats & ints)

// This does NOT work properly:
cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice);  // first copy
cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice);  // second copy

The second copy is trashing the contents of the first copy (flt), and the second copy does not happen. (If I remove the second copy, the first copy works fine.)

Results:

GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320)  TotThrds=1   ** Source of 1st copy
  0x500500500: float[320]= 1.000
  0x500500504: float[321]= 0.866
  0x500500508: float[322]= 0.500
  0x50050050c: float[323]= -0.000
  0x500500510: float[324]= -0.500
  0x500500514: float[325]= -0.866
  0x500500518: float[326]= -1.000
  0x50050051c: float[327]= -0.866
  0x500500520: float[328]= -0.500
  0x500500524: float[329]= 0.000
  0x500500528: float[330]= 0.500
  0x50050052c: float[331]= 0.866
  0x500500530: float[332]= 1.000
  GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320)  TotThrds=1     ** Dest of 1st copy
  0x500100f98: float[320]= 0.000
  0x500100f9c: float[321]= 0.500
  0x500100fa0: float[322]= 0.866
  0x500100fa4: float[323]= 1.000
  0x500100fa8: float[324]= 0.866
  0x500100fac: float[325]= 0.500
  0x500100fb0: float[326]= -0.000
  0x500100fb4: float[327]= -0.500
  0x500100fb8: float[328]= -0.866
  0x500100fbc: float[329]= -1.000
  0x500100fc0: float[330]= -0.866
  0x500100fc4: float[331]= -0.500
  0x500100fc8: float[332]= 0.000
  GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0)  TotThrds=1      ** Source of 2nd copy
  0x500500780: int[0]= 1
  0x500500784: int[1]= 1
  0x500500788: int[2]= 1
  0x50050078c: int[3]= 1
  0x500500790: int[4]= 1
  0x500500794: int[5]= 1
  0x500500798: int[6]= 1
  0x50050079c: int[7]= 1
  0x5005007a0: int[8]= 1
  0x5005007a4: int[9]= 1
  0x5005007a8: int[10]= 1
  0x5005007ac: int[11]= 1
  0x5005007b0: int[12]= 0
  GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0)  TotThrds=1      ** Dest of 2nd copy
  0x500100818: int[0]= 0
  0x50010081c: int[1]= 0
  0x500100820: int[2]= 0
  0x500100824: int[3]= 0
  0x500100828: int[4]= 0
  0x50010082c: int[5]= 0
  0x500100830: int[6]= 0
  0x500100834: int[7]= 0
  0x500100838: int[8]= 0
  0x50010083c: int[9]= 0
  0x500100840: int[10]= 0
  0x500100844: int[11]= 0
  0x500100848: int[12]= 0

The following works properly:

cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice);  // first copy
int* pTemp;
cudaGetSymbolAddress((void**) &pTemp, ints);
cudaMemcpy(ints,pFlts+480,640,cudaMemcpyDeviceToDevice);  // second copy

Results:

  GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320)  TotThrds=1   ** Source of first copy
  0x500500500: float[320]= 1.000
  0x500500504: float[321]= 0.866
  0x500500508: float[322]= 0.500
  0x50050050c: float[323]= -0.000
  0x500500510: float[324]= -0.500
  0x500500514: float[325]= -0.866
  0x500500518: float[326]= -1.000
  0x50050051c: float[327]= -0.866
  0x500500520: float[328]= -0.500
  0x500500524: float[329]= 0.000
  0x500500528: float[330]= 0.500
  0x50050052c: float[331]= 0.866
  0x500500530: float[332]= 1.000
  GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320)  TotThrds=1    ** Dest of first copy
  0x500100f98: float[320]= 1.000
  0x500100f9c: float[321]= 0.866
  0x500100fa0: float[322]= 0.500
  0x500100fa4: float[323]= -0.000
  0x500100fa8: float[324]= -0.500
  0x500100fac: float[325]= -0.866
  0x500100fb0: float[326]= -1.000
  0x500100fb4: float[327]= -0.866
  0x500100fb8: float[328]= -0.500
  0x500100fbc: float[329]= 0.000
  0x500100fc0: float[330]= 0.500
  0x500100fc4: float[331]= 0.866
  0x500100fc8: float[332]= 1.000
  GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0)  TotThrds=1    ** Source of 2nd copy
  0x500500780: int[0]= 1
  0x500500784: int[1]= 1
  0x500500788: int[2]= 1
  0x50050078c: int[3]= 1
  0x500500790: int[4]= 1
  0x500500794: int[5]= 1
  0x500500798: int[6]= 1
  0x50050079c: int[7]= 1
  0x5005007a0: int[8]= 1
  0x5005007a4: int[9]= 1
  0x5005007a8: int[10]= 1
  0x5005007ac: int[11]= 1
  0x5005007b0: int[12]= 0
  GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0)  TotThrds=1    ** Destination of 2nd copy
  0x500100818: int[0]= 1
  0x50010081c: int[1]= 1
  0x500100820: int[2]= 1
  0x500100824: int[3]= 1
  0x500100828: int[4]= 1
  0x50010082c: int[5]= 1
  0x500100830: int[6]= 1
  0x500100834: int[7]= 1
  0x500100838: int[8]= 1
  0x50010083c: int[9]= 1
  0x500100840: int[10]= 1
  0x500100844: int[11]= 1
  0x500100848: int[12]= 0

When I look at the bad case, it appears as though something has happened to the symbol table. As in, the data of the first copy destination is very familiar. Not like it has been overwritten, just moved. Like the pointer is wrong.

are you doing error checking on your cuda calls? You were given an example [here](http://stackoverflow.com/questions/14968293/copy-symbol-address-to-symbol). — Robert Crovella, Mar 13 '13 at 17:03
Yes, I did not include the macro that does the checking. No errors are being reported. (cudaSuccess) — Doug, Mar 13 '13 at 17:18
The offset applies to the symbol, not the source. That is your problem. — talonmies, Mar 13 '13 at 17:26
Your error checking is broken then. The cuda runtime definitely throws an error for this condition. I just tested it using the error checking example I linked to above. — Robert Crovella, Mar 13 '13 at 17:41
I get "invalid argument" if the COUNT exceeds the destination size. But there is no err if the offset exceeds the destination size. — Doug, Mar 13 '13 at 17:46
The macro compares the Memcpy return value against 'cudaSuccess'. It does not call cudaGetLastError(). Is this what you are seeing/using? — Doug, Mar 13 '13 at 17:54
I've updated my answer with a simple test case that shows that *either* the method of checking the function return value *or* the method using `cudaGetLastError()` indicates an error for this condition. I suggest you examine your error checking methodology. A known good example is [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) as well. — Robert Crovella, Mar 13 '13 at 18:30
There is no error when building w/ the -G flag. Build with "nvcc junk.cu -G -o junk" and cudaMemcpyToSymbol() returns cudaSuccess. Build with "nvcc junk.cu -o junk" and cudaMemcpyToSymbol() returns "invalid argument". — Doug, Mar 13 '13 at 18:30
I get the error in the test code that I posted even when I compile with `nvcc -G -arch=sm_20 -o t94 t94.cu` If you can provide a simple example (about as simple as mine) that shows the behavior you describe, I'd like to see it. — Robert Crovella, Mar 13 '13 at 18:35
See example @ http://stackoverflow.com/questions/15394045/removing-runtime-errors-with-the-g-flag — Doug, Mar 13 '13 at 18:56

Robert Crovella · Accepted Answer · 2013-03-13T18:27:57.423

The second copy looks broken to me. You have defined this array:

__constant__ int   ints[160];  // 640 bytes

which as correctly noted is 640 bytes long.

Your second copy is like this:

cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice);  // second copy

Which says, "copy a total of 640 bytes, from pFlts array to ints array, with the storage location in the ints array beginning at 1920 bytes from the start of the array."

This won't work. The ints array is only 640 bytes long. You can't pick as your destination a location that is 1920 bytes into it.

From the documentation for cudaMemcpyToSymbol :

offset- Offset from start of symbol in bytes

In this case the symbol is ints

Probably what you want is:

cudaMemcpyToSymbol(ints,pFlts+480,640,0,cudaMemcpyDeviceToDevice);  // second copy

EDIT: In response to the questions in the comments about error checking, I crafted this simple test program:

#include <stdio.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__constant__ int ints[160];

int main(){

  int *d_ints;
  cudaError_t mystatus;

  cudaMalloc((void **)&d_ints, sizeof(int)*160);
  cudaCheckErrors("cudamalloc fail");
  mystatus = cudaMemcpyToSymbol(ints, d_ints, 160*sizeof(int), 1920, cudaMemcpyDeviceToDevice);
  if (mystatus != cudaSuccess) printf("returned value was not cudaSuccess\n");
  cudaCheckErrors("cudamemcpytosymbol fail");

  printf("OK!\n");
  return 0;
}

When I compile and run this, I get the following output:

returned value was not cudaSuccess
Fatal error: cudamemcpytosymbol fail (invalid argument at t94.cu:26)
*** FAILED - ABORTING

This indicates that both the error return value from the cudaMemcpyToSymbol function call and the cudaGetLastError() method return an error in this case. If I change the 1920 parameter to zero in this test case, the error goes away.

Yes, I was using the OFFSET as an offset into the source, not the destination. (Bassackwards) Thanks — Doug, Mar 13 '13 at 17:30

cudaMemcpyToSymbol vs cudaMemcpy

1 Answers1

Linked

Related