Cuda Basic Program (Write values to matrix and std:cout does not work) ; Main function does not start

Question

I wrote a pretty simple Cuda Program. I would like to assign values to a matrix in device memory. Then I want to copy the values to the host and display them. The program I wrote does not work. But I don't know why. I tried to figure out what I do wrong by displaying the status with cout, but even this does not work, so I am thinking that the main function is not starting.

Does anyone know what's the problem?

Here is my code:

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include <iostream>
    #include <stdio.h>
    const int N = 1024;


    __global__ void matrix(float *d_A)
    {
        int col = blockIdx.x * blockDim.x + threadIdx.x;
        int row = blockIdx.y * blockDim.y + threadIdx.y;


        int index = col + row * N;
        if (col < N && row < N)
        {
            d_A[index] = 255;
        }
    }
    int main()
    {
        std::cout << "Programm begins";
        float A[N * N];
        float d_A[N * N];

        cudaMalloc((void**)&d_A, (N * N)*sizeof(float));
        std::cout << "Matrizes allocated";
        std::cout << A[0] << " , " << A[1] << " , " << A[2] << " , " << A[3] << " , " << A[4] << " , " << A[5] << "\n";
        std::cout << A[1024] << " , " << A[1025] << " , " << A[1026] << " , " << A[1027] << " , " << A[1028] << " , " << A[1029] << "\n";
        matrix << <1024, 1024 >> >(d_A);
        std::cout << "Wrote Matrix to local device memory";
        std::cout << d_A[0] << " , " << d_A[1] << " , " << d_A[2] << " , " << d_A[3] << " , " << d_A[4] << " , " << d_A[5] << "\n";
        std::cout << d_A[1024] << " , " << d_A[1025] << " , " << d_A[1026] << " , " << d_A[1027] << " , " << d_A[1028] << " , " << d_A[1029] << "\n";


        cudaMemcpy(A, d_A, N * N * sizeof(float), cudaMemcpyDeviceToHost);
        std::cout << "Wrote Matrix to host memory";
        std::cout << A[0] << " , " << A[1] << " , " << A[2] << " , " << A[3] << " , " << A[4] << " , " << A[5] << "\n";
        std::cout << A[1024] << " , " << A[1025] << " , " << A[1026] << " , " << A[1027] << " , " << A[1028] << " , " << A[1029] << "\n";

        return 0;
    }

https://stackoverflow.com/questions/1847789/segmentation-fault-on-large-array-sizes . Even after you change that, I see several other problems which will prevent the code from working — talonmies, Sep 18 '18 at 10:05
As a general comment, float d_a[N * N] is allocating host memory as well as allocating device memory with cudaMalloc, you should really declare it as float *d_A (a pointer), and then free the memory at the end of the program. Also, I'd try a device function much simpler as a starter, such as d_A[threadIdx.x] = 1; — Phill, Sep 18 '18 at 10:11
thank you, so when I don't use cudaMalloc, d_A is also located in the device? I deleted the cuda Malloc command. But the program is still teh same. Its displaying no "cout" output. Do you know waht might cause this? — Marcel Rudolf, Sep 18 '18 at 10:17
You must use cudaMalloc for d_A. And you cannot print its contents on the host — talonmies, Sep 18 '18 at 10:23
@talononmies: thank you! you are right I am getting a stackoverflow. Could I solve this problem with textures? — Marcel Rudolf, Sep 18 '18 at 10:23

score 2 · Answer 1 · answered Sep 18 '18 at 10:28

There's a few issues with your code, I'd simplify the code even further if this is your first steps into Cuda and C++. Try this out (important changes surrounded by comment stars);

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>

const int Grids = 256;
const int Threads = 256;

__global__ void matrix(float *d_A)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    d_A[idx] = 1.0;
}

int main()
{
    std::cout << "Programm begins";

    // ****
    float *A = new float[Grids * Threads];
    float *d_A;
    // ****

    cudaMalloc((void**)&d_A, (Grids * Threads)*sizeof(float));

    matrix<<<Grids, Threads>>>(d_A);

    cudaMemcpy(A, d_A, Grids * Threads*sizeof(float), cudaMemcpyDeviceToHost);

    for(int i=0; i < (Grids * Threads); ++i)
    {
         cout << A[i] << ",";
    }

    // ****
    cudaFree(d_A);
    delete A;
    // ****  

    return 0;
}

Also have a look here for basic examples, https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/

A couple of issues here;

1) The host memory you're allocating N * N (1024 * 1024) is a large amount and may not be available from the heap

2) When you declared the d_A variable you also declared host memory, as well as device memory for it, which is not required

3) You didn't free the device memory for d_A

4) Your device/GPU may not be capable of running 1024 threads at once; in which case it may fail silently and you will end up with your kernel not running.

sgarizvi · Accepted Answer · 2018-09-18T10:34:23.510

There are a few problems with the code you have provided.

De-referencing device memory from the host e.g. d_A[0] is illegal and will result in undefined behavior.
Treating the data as 2 dimensional inside the kernel whereas the grid and block are provide as 1 dimensional. In this case, the row variable will always be 0 and will essentially play no role in calculation of index. Define grid and block sizes as dim3 type to create 2D grid and block.
It is not recommended to create large size arrays on the stack such as float A[N*N];. Prefer dynamic memory allocation with new operator.
Allocation of device memory to already allocated host array d_A is undefined behavior. If you want to allocate device memory to a variable, just declare it as a simple pointer like float* d_A;.

The fixed code may look like this:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>
const int N = 1024;

__global__ void matrix(float *d_A)
{
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;


    int index = col + row * N;
    if (col < N && row < N)
    {
        d_A[index] = 255;
    }
}
int main()
{
    std::cout << "Programm begins"<<std::endl;
    float *A = new float[N*N];
    float *d_A;

    cudaMalloc((void**)&d_A, (N * N)*sizeof(float));
    std::cout << "Matrizes allocated"<<std::endl;
    std::cout << A[0] << " , " << A[1] << " , " << A[2] << " , " << A[3] << " , " << A[4] << " , " << A[5] <<std::endl;
    std::cout << A[1024] << " , " << A[1025] << " , " << A[1026] << " , " << A[1027] << " , " << A[1028] << " , " << A[1029] <<std::endl;

    dim3 block(32,32);
    dim3 grid;
    grid.x = (N + block.x - 1) / block.x;
    grid.y = (N + block.y - 1) / block.y;

    matrix << <grid, block >> >(d_A);
    std::cout << "Wrote Matrix to local device memory"<<std::endl;

    cudaMemcpy(A, d_A, N * N * sizeof(float), cudaMemcpyDeviceToHost);
    std::cout << "Wrote Matrix to host memory"<<std::endl;
    std::cout << A[0] << " , " << A[1] << " , " << A[2] << " , " << A[3] << " , " << A[4] << " , " << A[5] <<std::endl;
    std::cout << A[1024] << " , " << A[1025] << " , " << A[1026] << " , " << A[1027] << " , " << A[1028] << " , " << A[1029] <<std::endl;

    cudaFree(d_A);
    delete[] A;

    return 0;
}

It is highly recommended to add error checking for each CUDA API call to ease the debugging process.

Thank you very much. It works now. thank you also for explining me what my mistakes were. This is was very helpful as I am a beginner, and as you see, the simplest problems are a big challenge for me ;-) — Marcel Rudolf, Sep 18 '18 at 11:01
@MarcelRudolf... You are welcome. Everyone is a beginner at some point. You may consider accepting an answer if your problem is solved. That will help future users if they encounter a similar problem. — sgarizvi, Sep 18 '18 at 11:03

Cuda Basic Program (Write values to matrix and std:cout does not work) ; Main function does not start

2 Answers2