How does CUDA's nppiMalloc... function guarantee alignment?

Question

Something that's been confusing me for a while is the alignment requirement of allocated CUDA memories. I know that if they are aligned, accessing row elements will be much more efficient.

First a little background:

According to CUDA C Programming Guide (section 5.3.2):

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally alignedOnly the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

My understanding is that for a 2D interleaved array of type T, (say pixel values in R,G,B order), if numChannels * sizeof(T) is either 4, 8 or 16, then the array has to be allocated using cudaMallocPitch if performance is a necessity. So far this has been working fine for me. I'd check numChannels * sizeof(T) before allocating a 2D array and if it is 4, 16 or 32, I allocate it using cudaMallocPitch and everything works.

Now the question:

I've realized that when using NVIDIA's NPP library, there is a family of allocator functions (nppiMalloc... like nppiMalloc_32f_C1 and so on). NVIDIA has recommended using these functions for performance. My question is that, how are these functions guaranteeing the alignment? More specifically, what kind of math are they using to come up with a suitable value for pitch?

For a single channel 512x512 pixel image (with float pixel values in the range [0, 1]) I've used both cudaMallocPitch and nppiMalloc_32f_C1.
cudaMallocPitch gave me a pitch value of 2048 while nppiMalloc_32f_C1 gave me 2560. Where is the latter number coming from and how exactly is that?

Why I care about this
I'm writing a synced memory class template for synchronizing values on GPU and CPU. This class is supposed to be taking care of allocating pitched memories (if possible) under the hood. Since I want this class to be interoperable with NVIDIA's NPP, I'd like to handle all allocations in a way that would provide good performance for CUDA kernels as well as NPP operations.
My impression was that nppiMalloc was calling cudaMallocPitch under the hood, but it seems that I'm wrong.

@sgarizvi Thanks for the bounty good sir. – Maghoumi Nov 07 '14 at 21:03 — Maghoumi, Nov 07 '14 at 21:03

Marco13 · Accepted Answer · 2014-11-14T13:55:37.563

An interesting question. However, there may be no definitive answer at all, for several reasons: The implementation of these methods is not publicly available. One has to assume that NVIDIA uses some special tricks and tweaks internally. Moreover: The resulting pitch is not specified. So one has to assume that it might change between several releases of CUDA/NPP. Particularly, it's not unlikely that the actual pitch will depend on the hardware version (the "Compute Capability") of the device that the method is executed on.

Nevertheless, I was curious about this and wrote the following test:

#include <stdio.h>
#include <npp.h>

template <typename T>
void testStepBytes(const char* name, int elementSize, int numComponents, 
    T (*allocator)(int, int, int*))
{
    printf("%s\n", name);
    int dw = 1;
    int prevStepBytes = 0;
    for (int w=1; w<2050; w+=dw)
    {
        int stepBytes;
        void *p = allocator(w, 1, &stepBytes);
        nppiFree(p);
        if (stepBytes != prevStepBytes)
        {
            printf("Stride %5d is used up to w=%5d (%6d bytes)\n", 
                prevStepBytes, (w-dw), (w-dw)*elementSize*numComponents);
            prevStepBytes = stepBytes;
        }
    }
}

int main(int argc, char *argv[])
{
    testStepBytes("nppiMalloc_8u_C1", 1, 1, &nppiMalloc_8u_C1);
    testStepBytes("nppiMalloc_8u_C2", 1, 2, &nppiMalloc_8u_C2);
    testStepBytes("nppiMalloc_8u_C3", 1, 3, &nppiMalloc_8u_C3);
    testStepBytes("nppiMalloc_8u_C4", 1, 4, &nppiMalloc_8u_C4);

    testStepBytes("nppiMalloc_16u_C1", 2, 1, &nppiMalloc_16u_C1);
    testStepBytes("nppiMalloc_16u_C2", 2, 2, &nppiMalloc_16u_C2);
    testStepBytes("nppiMalloc_16u_C3", 2, 3, &nppiMalloc_16u_C3);
    testStepBytes("nppiMalloc_16u_C4", 2, 4, &nppiMalloc_16u_C4);

    testStepBytes("nppiMalloc_32f_C1", 4, 1, &nppiMalloc_32f_C1);
    testStepBytes("nppiMalloc_32f_C2", 4, 2, &nppiMalloc_32f_C2);
    testStepBytes("nppiMalloc_32f_C3", 4, 3, &nppiMalloc_32f_C3);
    testStepBytes("nppiMalloc_32f_C4", 4, 4, &nppiMalloc_32f_C4);

    return 0;
}

The pitch (stepBytes) seemed to solely depend on the width of the image. So this program allocates memory for images of different types, with an increasing width, and prints information about the maximum image sizes that result in a particular stride. The intention was to derive a pattern or a rule - namely the "kind of math" that you asked about.

The results are ... somewhat confusing. For example, for the nppiMalloc_32f_C1 call, on my machine (CUDA 6.5, GeForce GTX 560 Ti, Compute Capability 2.1), it prints:

nppiMalloc_32f_C1
Stride     0 is used up to w=    0 (     0 bytes)
Stride   512 is used up to w=  120 (   480 bytes)
Stride  1024 is used up to w=  248 (   992 bytes)
Stride  1536 is used up to w=  384 (  1536 bytes)
Stride  2048 is used up to w=  504 (  2016 bytes)
Stride  2560 is used up to w=  640 (  2560 bytes)
Stride  3072 is used up to w=  768 (  3072 bytes)
Stride  3584 is used up to w=  896 (  3584 bytes)
Stride  4096 is used up to w= 1016 (  4064 bytes)
Stride  4608 is used up to w= 1152 (  4608 bytes)
Stride  5120 is used up to w= 1280 (  5120 bytes)
Stride  5632 is used up to w= 1408 (  5632 bytes)
Stride  6144 is used up to w= 1536 (  6144 bytes)
Stride  6656 is used up to w= 1664 (  6656 bytes)
Stride  7168 is used up to w= 1792 (  7168 bytes)
Stride  7680 is used up to w= 1920 (  7680 bytes)
Stride  8192 is used up to w= 2040 (  8160 bytes)

confirming that for an image with width=512, it will use a stride of 2560. The expected stride of 2048 would be used for an image up to width=504.

The numbers seem a bit odd, so I ran another test for nppiMalloc_8u_C1 in order to cover all possible image line sizes (in bytes), with larger image sizes, and noticed a strange pattern: The first increase of the pitch size (from 512 to 1024) occurred when the image was larger than 480 bytes, and 480=512-32. The next step (from 1024 to 1536) occurred when the image was larger than 992 bytes, and 992=480+512. The next step (from 1536 to 2048) occurred when the image was larger than 1536 bytes, and 1536=992+512+32. From there, it seemed to mostly run in steps of 512, except for several sizes in between. The further steps are summarized here:

nppiMalloc_8u_C1
Stride      0 is used up to w=     0 (     0 bytes, delta     0)
Stride    512 is used up to w=   480 (   480 bytes, delta   480)
Stride   1024 is used up to w=   992 (   992 bytes, delta   512)
Stride   1536 is used up to w=  1536 (  1536 bytes, delta   544)
Stride   2048 is used up to w=  2016 (  2016 bytes, delta   480) \
Stride   2560 is used up to w=  2560 (  2560 bytes, delta   544) | 4
Stride   3072 is used up to w=  3072 (  3072 bytes, delta   512) |
Stride   3584 is used up to w=  3584 (  3584 bytes, delta   512) /
Stride   4096 is used up to w=  4064 (  4064 bytes, delta   480)     \
Stride   4608 is used up to w=  4608 (  4608 bytes, delta   544)     |
Stride   5120 is used up to w=  5120 (  5120 bytes, delta   512)     |
Stride   5632 is used up to w=  5632 (  5632 bytes, delta   512)     | 8
Stride   6144 is used up to w=  6144 (  6144 bytes, delta   512)     |
Stride   6656 is used up to w=  6656 (  6656 bytes, delta   512)     |
Stride   7168 is used up to w=  7168 (  7168 bytes, delta   512)     |
Stride   7680 is used up to w=  7680 (  7680 bytes, delta   512)     /
Stride   8192 is used up to w=  8160 (  8160 bytes, delta   480) \
Stride   8704 is used up to w=  8704 (  8704 bytes, delta   544) |
Stride   9216 is used up to w=  9216 (  9216 bytes, delta   512) |
Stride   9728 is used up to w=  9728 (  9728 bytes, delta   512) |
Stride  10240 is used up to w= 10240 ( 10240 bytes, delta   512) |
Stride  10752 is used up to w= 10752 ( 10752 bytes, delta   512) |
Stride  11264 is used up to w= 11264 ( 11264 bytes, delta   512) |
Stride  11776 is used up to w= 11776 ( 11776 bytes, delta   512) | 16
Stride  12288 is used up to w= 12288 ( 12288 bytes, delta   512) |
Stride  12800 is used up to w= 12800 ( 12800 bytes, delta   512) |
Stride  13312 is used up to w= 13312 ( 13312 bytes, delta   512) |
Stride  13824 is used up to w= 13824 ( 13824 bytes, delta   512) |
Stride  14336 is used up to w= 14336 ( 14336 bytes, delta   512) |
Stride  14848 is used up to w= 14848 ( 14848 bytes, delta   512) |
Stride  15360 is used up to w= 15360 ( 15360 bytes, delta   512) |
Stride  15872 is used up to w= 15872 ( 15872 bytes, delta   512) /
Stride  16384 is used up to w= 16352 ( 16352 bytes, delta   480)     \
Stride  16896 is used up to w= 16896 ( 16896 bytes, delta   544)     |
Stride  17408 is used up to w= 17408 ( 17408 bytes, delta   512)     |
...                                                                ... 32
Stride  31232 is used up to w= 31232 ( 31232 bytes, delta   512)     |
Stride  31744 is used up to w= 31744 ( 31744 bytes, delta   512)     |
Stride  32256 is used up to w= 32256 ( 32256 bytes, delta   512)     /
Stride  32768 is used up to w= 32736 ( 32736 bytes, delta   480) \
Stride  33280 is used up to w= 33280 ( 33280 bytes, delta   544) |
Stride  33792 is used up to w= 33792 ( 33792 bytes, delta   512) |
Stride  34304 is used up to w= 34304 ( 34304 bytes, delta   512) |
...                                                            ... 64
Stride  64512 is used up to w= 64512 ( 64512 bytes, delta   512) |
Stride  65024 is used up to w= 65024 ( 65024 bytes, delta   512) /
Stride  65536 is used up to w= 65504 ( 65504 bytes, delta   480)     \
Stride  66048 is used up to w= 66048 ( 66048 bytes, delta   544)     |   
Stride  66560 is used up to w= 66560 ( 66560 bytes, delta   512)     |
Stride  67072 is used up to w= 67072 ( 67072 bytes, delta   512)     |
....                                                               ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta   512)     |
Stride 130560 is used up to w=130560 (130560 bytes, delta   512)     /
Stride 131072 is used up to w=131040 (131040 bytes, delta   480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta   544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta   512) |
...                                                              | guess...

There obviously is a pattern. The pitches are related to multiples of 512. For sizes of 512*2ⁿ, with n being a whole number, there are some odd -32 and +32 offsets for the size limits that cause a larger pitch to be used.

Maybe I'll have another look on this. I'm pretty sure that one could derive a formula covering this odd progression of the pitch. But again: This may depend on the underlying CUDA version, the NPP version, or even the Compute Capability of the card that is used.

_{And, just for completeness: It might also be the case that this strange pitch size simply is a bug in NPP. You never know.}

Well if it isn't Mr. JCuda himself! (SirM2X on your forum here! :D) Thanks for the experiments and your (as usual) precise elaboration. It really helped a lot and gave wonderful insight — Maghoumi, Nov 07 '14 at 21:03
pitch is *never* supposed to be something that you assign manually. It is *always* supposed to be returned by an appropriate API call that does a pitched allocation, and you would use that pitch in future calls when accessing that allocation. This sort of reverse engineering, while perhaps interesting, should not be used to replace the proper method/utilization of pitch. By "reverse engineering" I simply mean attempting to deduce an underlying formula which will then be used in place of the proper method. — Robert Crovella, Nov 08 '14 at 04:15
At (Sir)M2X: Thought so ;-) @RobertCrovella Sure, that's what I tried to emphasize when I said (twice) that the pitch may depend on internal details of the black box. I think there *could* be cases where the rules for computing the pitch is specified. For example, the `nppiMalloc` calls *could* specify "The pitch will be `w*elementsSize*componentSize` rounded to the next multiple of the number of cores per SMP", or so. It *could* also be offered e.g. by the `cudaDeviceProps`. Then one *could* query this information manually, and compute the pitch manually. But here, this is not the case... — Marco13, Nov 08 '14 at 12:53
@RobertCrovella Actually I am not setting any pitch manually. The only reason I care about this is that I'd like to simply check for the "opportunity" of allocating pitched memory (hence calling cudaMallocPitch instead of calling cudaMalloc). If I know when I'd be using either, then I don't need to worry about anything else. If I know that for the given width, NPP would allocate pitched memory, then I'd allocate a pitched memory myself using cudaMallocPitch. — Maghoumi, Nov 09 '14 at 01:13
@RobertCrovella Then again, it seems that cudaMallocPitch and nppiMalloc use different pitch formulas and I cant guarantee performance on NPP if my allocation was done using cudaMallocPitch — Maghoumi, Nov 09 '14 at 01:14

score 1 · Answer 2 · answered Sep 07 '16 at 12:06

I thought i would contribute listings of several other allocation types. I am using a GTX 860M with cuda version 7.5.

cudaMallocPitch aligns to textureAlignment property, rather than texturePitchAlignment as i had suspected. The nppi mallocs also align to textureAlignment boundaries, but sometimes over allocate and make the jump to the next 512 bytes early.

Since all of these functions align each row to textureAlignment rather than the smaller texturePitchAlignment more space is used, but textures should be able to be bound to any start line without having to use a byte offset for address calculation. The documentation can be unclear for textures, but it works out to be that they require a line pitch that is a multiple of 32 (on this generation of hardware, texturePitchAlignment property) and the address of the start point must be a multiple of 128, 256, or 512 depending on hardware and cuda version (textureAlignment). Textures may be able to be bound to a smaller multiple, and my own experience before finding the correct property was that 256 byte alignment seems to work fine.

512 byte alignment is fairly large, however there may be performance gains for both textures and non-textures over using the texturePitchAlignment value. I haven't done any testing. For future proofing, i would suggest using cudaMallocPitch or nppiMalloc, but if memory space is tight you can manually allocate with texturePitchAlignment if using textures. Memory bandwith across the PCI bus should be unchanged by a larger pitch, provided you are using the cudaMemcpy2D or a similar function. I would recommend using the Nvidia functions for copying pitched memory across the PCI bus. If they aren't already highly optimized and using DMA controllers, they'll implement it eventually. For smaller pitches, it may be more memory efficient to just copy with the padding across the PCI bus in a bulk transfer, but that would require testing and potential CPU de-padding on the other side. I wonder if the Nvidia functions would de-pad on the GPU before transferring? Or a line by line DMA transfer? Maybe eventually if they don't already.

int main(int argc, char **argv)
{
    void *dmem;
    int pitch, pitchOld = 0;
    size_t pitch2;
    int iOld = 0;
    int maxAllocation = 5000;

    cudaDeviceProp prop;

    cudaGetDeviceProperties(&prop, 0);      

    printf("%s%d%s%d%s", "textureAlignment ", prop.textureAlignment, " texturePitchAlignment ", prop.texturePitchAlignment, "\n");

    printf("%s", "cudaMallocPitch\n");

    for (int i=0;i<maxAllocation;++i) {
        cudaMallocPitch(&dmem, &pitch2, i, 1);

        if (pitch2 != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch2;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    printf("%s", "nppiMalloc_8u_C1\n");

    for (int i=0;i<maxAllocation/sizeof(Npp8u);++i) {
        dmem = nppiMalloc_8u_C1(i, 1, &pitch);

        if (pitch != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    printf("%s", "nppiMalloc_32f_C1\n");

    for (int i=0;i<maxAllocation/sizeof(Npp32f);++i) {
        dmem = nppiMalloc_32f_C1(i, 1, &pitch);

        if (pitch != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    return 0;
}

And the output

textureAlignment 512 texturePitchAlignment 32
cudaMallocPitch
width 0to0 -> pitch 0
width 1to512 -> pitch 512
width 513to1024 -> pitch 1024
width 1025to1536 -> pitch 1536
width 1537to2048 -> pitch 2048
width 2049to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4096 -> pitch 4096
width 4097to4608 -> pitch 4608
nppiMalloc_8u_C1
width 0to0 -> pitch 0
width 1to480 -> pitch 512
width 481to992 -> pitch 1024
width 993to1536 -> pitch 1536
width 1537to2016 -> pitch 2048
width 2017to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4064 -> pitch 4096
width 4065to4608 -> pitch 4608
nppiMalloc_32f_C1
width 0to0 -> pitch 0
width 1to120 -> pitch 512
width 121to248 -> pitch 1024
width 249to384 -> pitch 1536
width 385to504 -> pitch 2048
width 505to640 -> pitch 2560
width 641to768 -> pitch 3072
width 769to896 -> pitch 3584
width 897to1016 -> pitch 4096
width 1017to1152 -> pitch 4608

How does CUDA's nppiMalloc... function guarantee alignment?

2 Answers2