CUDA - "Unaligned memory accesses not supported"

Question

Prevoius questions about this program:

Translating four nested loops into a CUDA kernel

I'm using Visual Studio 2012 and CUDA 6 Code is supposed to add blur effect onto BMP file using CUDA. Before converting to CUDA everything worked perfectly. This is my first project with C and CUDA both so I might have made some silly mistakes. I'm getting 76 errors with my code most of them are "this declaration has no storage class or type specifier" and more that doesn't make any sense. I tried before Hello World program from http://computer-graphics.se/hello-world-for-cuda.html and it works OK. There were the same errors so I'm not really concerned about them.

But I have two errors that are diffrent:

Error    2    error : Unaligned memory accesses not supported  C:\Users\Karpińscy\documents\visual studio 2012\Projects\blur\blur\kernel.cu    blur

And:

Error 3 error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2012 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -G -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "C:\Users\Karpińscy\documents\visual studio 2012\Projects\blur\blur\kernel.cu"" exited with code 2. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations\CUDA 6.0.targets 597 9 blur

I have searched for answers even on second site of google.com and I haven't fund solution that would work for me. Help me please!

Program code:

#include <stdio.h>
#include <stdlib.h>
#include <Windows.h>


#pragma pack(push,1)
/* Windows 3.x bitmap file header */
typedef struct {
    char         filetype[2];   /* magic - always 'B' 'M' */
    unsigned int filesize;
    short        reserved1;
    short        reserved2;
    unsigned int dataoffset;    /* offset in bytes to actual bitmap data */
} file_header;

/* Windows 3.x bitmap full header, including file header */
typedef struct {
    file_header  fileheader;
    unsigned int headersize;
    int          width;
    int          height;
    short        planes;
    short        bitsperpixel;  /* we only support the value 24 here */
    unsigned int compression;   /* we do not support compression */
    unsigned int bitmapsize;
    int          horizontalres;
    int          verticalres;
    unsigned int numcolors;
    unsigned int importantcolors;
} bitmap_header;
#pragma pack(pop)

__global__ void blur(bitmap_header* hp, unsigned char *data)
{
    int xx,yy,x,y, avgB, avgG, avgR, ile;
    int blurSize = 5;

    xx = blockIdx.y * blockDim.y + threadIdx.y;
    yy = blockIdx.x * blockDim.x + threadIdx.x;

    if(xx >= hp->width || yy >= hp->height)
        return;


    avgB = avgG = avgR = 0;
    ile = 0;

    for(x = xx; x < hp->width && x < xx + blurSize; x++)
    {


        for(y = yy; y < hp->height && y < yy + blurSize; y++)
        {
            avgB += data[x*3 + y*hp->width*3 + 0];
            avgG += data[x*3 + y*hp->width*3 + 1];
            avgR += data[x*3 + y*hp->width*3 + 2];
            ile++;
        }
    }

    avgB = avgB / ile;
    avgG = avgG / ile;
    avgR = avgR / ile;

    data[xx*3 + yy*hp->width*3 + 0] = avgB;
    data[xx*3 + yy*hp->width*3 + 1] = avgG;
    data[xx*3 + yy*hp->width*3 + 2] = avgR;
}

int filter(char* input, char *output)
{
    FILE *fp,*out;
    bitmap_header* hp;
    bitmap_header* d_hp;
    unsigned char *data;
    unsigned char *d_data;

    //Open input file:
    fp = fopen(input, "r");
    if(fp==NULL)
        return 1;

    //Read the input file headers:
    hp=(bitmap_header*)malloc(sizeof(bitmap_header));

    cudaMalloc( &d_hp, sizeof(bitmap_header));

    if(hp==NULL)
        return 1;

    fread(hp, sizeof(bitmap_header), 1, fp);

    cudaMemcpy(d_hp, hp, sizeof(bitmap_header), cudaMemcpyHostToDevice);

    //Read the data of the image:
    data = (unsigned char*)malloc(sizeof(char)*hp->bitmapsize);

    cudaMalloc( &d_data, sizeof(char)*hp->bitmapsize);

    fseek(fp,sizeof(char)*hp->fileheader.dataoffset,SEEK_SET);
    fread(data,sizeof(char),hp->bitmapsize, fp);

    cudaMemcpy(d_data, data, sizeof(char)*hp->bitmapsize, cudaMemcpyHostToDevice);

    //Not sure if correctly calling function
    dim3 block(16,16);
    dim3 grid ( (hp->height + 15)/16, (hp->width + 15)/16 );
    blur<<<grid,block>>>(d_hp, d_data);

    cudaMemcpy(data, d_data, sizeof(char)*hp->bitmapsize, cudaMemcpyDeviceToHost);

    //Open output file:
    out = fopen(output, "wb");
    if(out==NULL)
    {
        fclose(fp);
        free(hp);
        free(data);
        cudaFree(d_data);
        cudaFree(d_hp);
        return 1;
    }

    fwrite(hp,sizeof(char),sizeof(bitmap_header),out);

    fseek(out,sizeof(char)*hp->fileheader.dataoffset,SEEK_SET);
    fwrite(data,sizeof(char),hp->bitmapsize,out);

    fclose(fp);
    fclose(out);
    free(hp);
    free(data);

    cudaFree(d_data);
    cudaFree(d_hp);
    return 0;
}

int main(int argc, char* argv[])
{
    char *path = "file.bmp";
    filter(path,path);

    return 0;
}

I have been asked to implement error checking from What is the canonical way to check for errors using the CUDA runtime API?, but I have no idea how or is it really going to help me.

EDIT:

I fixed those problems thanks to @DanielKamilKozar. Program compiles but blur isn't getting added onto BMP files. Is blur function properly called for CUDA syntax?

I have practically no experience or knowledge about CUDA, but the lines `data[x*3 + y*hp->width*3 + 2];` ring a bell when it comes to alignment. You seem to be trying to access memory with byte granularity, and that's apparently unsupported with CUDA. — Daniel Kamil Kozar, May 17 '14 at 21:55
Also, [this](http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/) seems to confirm my suspicions : _The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size._ — Daniel Kamil Kozar, May 17 '14 at 21:57
Plus, you're using packing on your structures, which means that their members are certainly not aligned to a word boundary. This is another thing to check. — Daniel Kamil Kozar, May 17 '14 at 21:59
@DanielKamilKozar Do You have any idea how can I modify this code so data valuse would be alligned? I have never touched something like that before. — Bartosz Karpiński, May 17 '14 at 22:03
Lose the packing of the structs, and replace it with proper (de-)serializing functions. Alternatively, pass the values that the kernel needs directly into it, and not the whole header structure. As for the data accesses, it seems like you'll just have to make requests that are large enough (how large, I don't know, you'd have to look it up) and use bit shifting/masking to obtain the 8-bit values that are needed for actual computation. — Daniel Kamil Kozar, May 17 '14 at 22:12
@DanielKamilKozar I passed needed information from header instead whole header and didn't change anything with pixel data and program works! EDIT It almost works... Program compiled but blur effect isnt getting added onto bmp files — Bartosz Karpiński, May 17 '14 at 22:30
@DanielKamilKozar I updated CUDA software and program works now. You suggested not to pass whole header but some of its content and that fixed my main problem, so if you want write answer and I will mark it as correct. — Bartosz Karpiński, May 17 '14 at 23:52
"This is my first project with C and CUDA" - note that "CUDA C" is actually a dialect of C++, so you may need to modify your code accordingly. — M.M, May 18 '14 at 02:35
The BMP format uses a header format that does not alignment data elements to their natural alignment. This will cause problems on the GPU and many other architectures. — Greg Smith, May 18 '14 at 18:49

score 1 · Accepted Answer · answered May 19 '14 at 08:53

1

I solved it by not sending full BMP header by function argument but it's necessary content. I had another problem with function wasnt getting called, I fixed that by updating CUDA software.

answered May 19 '14 at 08:53

Bartosz Karpiński

467
1
5
15

score 1 · Answer 2 · answered Feb 26 '15 at 07:16

1

I was able to solve this problem by changing arch value from sm_10 to sm_20. My app is running on GT750M on Win 8.1 x64 VS2012.

answered Feb 26 '15 at 07:16

user2415927

23
4

CUDA - "Unaligned memory accesses not supported"

2 Answers2