5

2D textures are a useful feature of CUDA in image processing applications. To bind pitch linear memory to 2D textures, the memory has to be aligned. cudaMallocPitch is a good option for aligned memory allocation. On my device, the pitch returned by cudaMallocPitch is a multiple of 512, i.e the memory is 512 byte aligned.

The actual alignment requirement for the device is determined by cudaDeviceProp::texturePitchAlignment which is 32 bytes on my device.

My question is:

If the actual alignment requirement for 2D textures is 32 bytes, then why does cudaMallocPitch return 512 byte aligned memory?

Isn't it a waste of memory? For example if I create an 8 bit image of size 513 x 100, it will occupy 1024 x 100 bytes.

I get this behaviour on following systems:

1: Asus G53JW + Windows 8 x64 + GeForce GTX 460M + CUDA 5 + Core i7 740QM + 4GB RAM

2: Dell Inspiron N5110 + Windows 7 x64 + GeForce GT525M + CUDA 4.2 + Corei7 2630QM + 6GB RAM

sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • What hardware is this on? I have always found that cudaMallocPitch honors the reported texture alignment. On the only device I have access to right now, the reported alignment in bytes is 256, and I always get multiples of 256 bytes for pitches. – talonmies Sep 23 '12 at 09:37
  • I have updated the question. Added detailed system configurations in the question. – sgarizvi Sep 23 '12 at 09:53

2 Answers2

4

This is a slightly speculative answer, but keep in mind that there are two alignment properties which the pitch of an allocation must satisfy for textures, one for the texture pointer and one for the texture rows. I suspect that cudaMallocPitch is honouring the former, defined by cudaDeviceProp::textureAlignment. For example:

#include <cstdio>

int main(void)
{
    const int ncases = 12;
    const size_t widths[ncases] = { 5, 10, 20, 50, 70, 90, 100,
        200, 500, 700, 900, 1000 };
    const size_t height = 10;

    float *vals[ncases];
    size_t pitches[ncases];

    struct cudaDeviceProp p;
    cudaGetDeviceProperties(&p, 0);
    fprintf(stdout, "Texture alignment = %zd bytes\n",
            p.textureAlignment);
    cudaSetDevice(0);
    cudaFree(0); // establish context

    for(int i=0; i<ncases; i++) {
        cudaMallocPitch((void **)&vals[i], &pitches[i], 
            widths[i], height);
        fprintf(stdout, "width = %zd <=> pitch = %zd \n",
                widths[i], pitches[i]);
    }

    return 0;
}

which gives the following on a GT320M:

Texture alignment = 256 bytes
width = 5 <=> pitch = 256 
width = 10 <=> pitch = 256 
width = 20 <=> pitch = 256 
width = 50 <=> pitch = 256 
width = 70 <=> pitch = 256 
width = 90 <=> pitch = 256 
width = 100 <=> pitch = 256 
width = 200 <=> pitch = 256 
width = 500 <=> pitch = 512 
width = 700 <=> pitch = 768 
width = 900 <=> pitch = 1024 
width = 1000 <=> pitch = 1024 

I am guessing that cudaDeviceProp::texturePitchAlignment applies to CUDA arrays.

Mohsen
  • 153
  • 11
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • I guess you are right. On both of my systems, i'm getting `cudaDeviceProp::textureAlignment == 512`. – sgarizvi Sep 23 '12 at 12:02
  • 3
    As talonmies says, there is one requirement for the alignment of the texture (textureAlignment, as I recall 256 byes on older hardware, 512 bytes on current hardware), plus an alignment requirement for each row (texturePitchAlignment). In general, texturePitchAlignment <= textureAlignment. The malloc functions in CUDA return memory suitably aligned for textures. – njuffa Sep 23 '12 at 14:43
3

After doing some experiments with the memory allocation, at last I found a working solution which saves memory. If I forcefully align the memory allocated by cudaMalloc, cudaBindTexture2D works perfectly.

cudaError_t alignedMalloc2D(void** ptr, int width, int height, int* pitch, int alignment = 32)
{       
   if((width% alignment) != 0)
      width+= (alignment - (width % alignment));

   (*pitch) = width;

   return cudaMalloc(ptr,width* height);
}

The memory allocated by this function is 32 byte aligned, which is the requirement of cudaBindTexture2D. My memory usage is now reduced 16 times and all the CUDA functions, which use 2D textures are also working correctly.

Here is a small utility function to get the currently selected CUDA device pitch alignment requirement.

int getCurrentDeviceTexturePitchAlignment()
{
   cudaDeviceProp prop;
   int currentDevice = 0;

   cudaGetDevice(&currentDevice);

   cudaGetDeviceProperties(&prop,currentDevice);

   return prop.texturePitchAlignment;
}
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • It works fine as you said, but does it honor the requirement for coalesced access to the memory? – Mohsen Feb 21 '19 at 13:52