2

I wrote a test to illustrate my problem, the code try to copy 16 bytes to an none-4-bytes-aligned memory, but the dest is modified automatically

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__
void Copy128(char *dest,const char *src)
{
    ((int*)dest)[0]=((int*)src)[0];
    ((int*)dest)[1]=((int*)src)[1];
    ((int*)dest)[2]=((int*)src)[2];
    ((int*)dest)[3]=((int*)src)[3];
}
__global__
void fill_src(char *src)
{
    for(int i=0; i<16; i++)
        src[i] = i+1; // starts from 1
}

int main()
{
    char* dest;
    cudaMalloc(&dest, 17);

    char* src;
    cudaMalloc(&src, 16);

    fill_src<<<1, 1>>>((char*)src); // fill some value for debugging

    // copy to dest+1 which is not aligned to 4
    Copy128<<<1, 1>>>(dest + 1, src);

    getchar();
}

Debugging the code in VS2013 as in the image, the target memory is 0x40A80001, but actually it copies to 0x40A80000. enter image description here

The problem is the dest would be modified automatically if it's not aligned to 4-byte. And it's modified silently, I spent hours to find this bug.

I know it's best to use well aligned memory, but I'm writing some rar decompression program, decompress some bytes then concat some bytes, it cannot be always aligned.

I guess I would use uint64 in function like Copy256. Is this normal behavior that memory is force aligned? Any compiling flags that can switch off this functionality? Or should I copy bytes one by one?

Environment: CUDA 6.5, Win7-32bit, VS2013

Community
  • 1
  • 1
aj3423
  • 2,003
  • 3
  • 32
  • 70
  • 5
    When I run your example code, I get an illegal write error within the Copy128 kernel because of unaligned memory access, which is exactly what should happen. I don't understand what point you are trying to make here – talonmies Apr 19 '16 at 11:48
  • 4
    Other than on x86 CPUs, all memory accesses on a GPU must be naturally aligned, that is, aligned to the size of the access, e.g. 4-byte access must be aligned to a 4-byte boundary. So on GPUs, this alignment for memory accesses is necessary for *functional correctness*, not just performance as on x86. This is mentioned in CUDA documentation. For misaligned copies you don't need to copy larger objects entirely byte-by-byte, just use the narrow accesses for the end cases and use wide copies for the bulk of the transfer. – njuffa Apr 19 '16 at 13:37

1 Answers1

4

- Is this normal behavior that memory is force aligned? Yes: Quoted from here, "Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes".

Any compiling flags that can switch off this functionality? I guess not, this is probably hardware related

Or should I copy bytes one by one? If you deal with (very) unaligned memory, it's your only option to avoid misaligned stores (as commented above). However, you should try to detect (either at compile time or at runtime) when your memory operations are aligned, and then use the widest load/store you have at hand (int4 leads to ldg instructions, which will give you a way better bandwidth)

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41