2

I have the following function with inline assembly that works fine on debug mode in 32 bit Visual Studio 2008:

__device__ void add(int* pa, int* pb)
{
  asm(".reg .u32   s<3>;"::);
  asm(".reg .u32   r<14>;"::);

  asm("ld.global.b32    s0, [%0];"::"r"(&pa));      //load addresses of pa, pb
  printf(...);
  asm("ld.global.b32    s1, [%0];"::"r"(&pb));
  printf(...);
  asm("ld.global.b32    r1, [s0+8];"::);
  printf(...);  
  asm("ld.global.b32    r2, [s1+8];"::);
  printf(...);

  ...// perform some operations
}

pa and pb are globally allocated on the device such as

__device__ int pa[3] = {0, 0x927c0000, 0x20000011};  
__device__ int pb[3] = {0, 0xbb900000, 0x2000000b};

However, this code fails on release mode, on line asm("ld.global.b32 r1, [s0+8];"::); How can I load function parameters correctly with inline ptx on release mode?

P.S. building the release mode with -G flag (Generates GPU debug info) causes the code to run correctly on release mode. Thank you,

Meriko
  • 161
  • 2
  • 11
  • Is your release build 32 bit as well? Otherwise the line will obviously fail. If not: It is a lot easier to help with an error message if you tell us what the error message is. – tera Apr 16 '13 at 16:46
  • The entity `pa` may be getting optimized into a register in release mode, and you can't take the address of a register. That's just a guess. It would be helpful if you defined whether the failure you are observing is at compile time or run time. If at compile time, what error exactly are you getting? If at runtime, how have you localized to this line of code? Why do you want to take the address of `pa` anyway? Taking the address of a function parameter seems odd. To modify pa in the calling context, you need to pass a pointer to it, i.e. `**pa` Did you mean `*pa` or `&(pa[0])` or `pa`? – Robert Crovella Apr 16 '13 at 16:51
  • @tera: Release build is 32 bit as well. Release version compiles without any errors. When I run, I get different output. Moreover, I have printf statements after each ld.global.b32 istruction. It fails after asm("ld.global.b32 r1, [s0+8];"::); such that it doesn't print the next printf statement. – Meriko Apr 16 '13 at 20:02
  • As Robert said, taking the address of a function pointer is odd. Why not just move the argument itself: `asm("mov.b32 s0, %0;"::"r"(pa));`? – tera Apr 16 '13 at 21:09
  • @RobertCrovella: The error is at runtime. I am getting totally wrong output; printf and other instructions after asm("ld.global.b32 r1, [s0+8];"::); are not executed, they're being skipped, and then the program exits with the wrong output. – Meriko Apr 16 '13 at 21:10
  • My question is since pa and pb are allocated on the device, is ld.global.b32 the correct way of loading the addresses, values of them? Or should I be using some other option for ld? – Meriko Apr 16 '13 at 21:12

1 Answers1

1

Hopefully this code will help. I'm still guessing at what you are trying to do exactly, but I started with your code and decided to add some values in the pa and pb arrays and store them back into pa[0] and pb[0].

This code is written for a 64 bit machine but converting it to 32 bit pointers should not be difficult. I have marked the lines that need to be changed for 32 bit pointers with a comment. Hopefully this will answer your question about how to use function parameters that are pointers to device memory:

#include <stdio.h>

__device__ int pa[3] = {0, 0x927c0000, 0x20000011};
__device__ int pb[3] = {0, 0xbb900000, 0x2000000b};

__device__ void add(int* mpa, int* mpb)
{
  asm(".reg .u64   s<2>;"::);  // change to .u32 for 32 bit pointers
  asm(".reg .u32   r<6>;"::);

  asm("mov.u64    s0, %0;"::"l"(mpa));      //change to .u32 and "r" for 32 bit
  asm("mov.u64    s1, %0;"::"l"(mpb));      //change to .u32 and "r" for 32 bit
  asm("ld.global.u32    r0, [s0+4];"::);
  asm("ld.global.u32    r1, [s1+4];"::);
  asm("ld.global.u32    r2, [s0+8];"::);
  asm("ld.global.u32    r3, [s1+8];"::);
  asm("add.u32    r4, r0, r2;"::);
  asm("add.u32    r5, r1, r3;"::);
  asm("st.global.u32    [s0], r4;"::);
  asm("st.global.u32   [s1], r5;"::);
}

__global__ void mykernel(){
  printf("pa[0] = %x, pb[0] = %x\n", pa[0], pb[0]);
  add(pa, pb);
  printf("pa[0] = %x, pb[0] = %x\n", pa[0], pb[0]);
}

int  main() {
  mykernel<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}

When I run this code I get:

$ ./t128
pa[0] = 0, pb[0] = 0
pa[0] = b27c0011, pb[0] = db90000b
$

which I believe is correct output.

I compiled it with:

nvcc -O3 -arch=sm_20 -o t128 t128.cu
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257