Adding values on GPU

Question

i have a class called Product.

Each product has a value and i want to add these values on GPU. I filled my array on host side

int * h_A, * d_A;
    h_A = (int*) malloc(enterNum * sizeof(int));
    cudaMalloc((void **) &d_A, enterNum * sizeof(int));



    Product p("Product", price);

    h_A[i] = p.getValue();

    while (i < enterNum) {
        i++;
        cout << "Enter product name:";
        cin >> desc;
        cout << "Enter product price:";
        cin >> price;
        Product p("Product", price);

        h_A[i] = p.getValue();
    }


    cudaMemcpy(d_A, h_A, enterNum, cudaMemcpyHostToDevice);
    priceSum<<<enterNum, 1024>>>(d_A,enterNum,result);

    int  result2 = 0;

    cudaMemcpy(result, result2, enterNum, cudaMemcpyDeviceToHost);

here cudaMemcpy function gives error because i dont use pointer. What can i do here? I dont need to use pointer here isn't it?

this is my summation function:

__global__ void priceSum(int *dA, int count, int result) {

    int tid = blockIdx.x;

    if (tid < count){
        result+= dA[tid];
    }
}

full code:

using namespace std;
#include "cuda_runtime.h"
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <stdlib.h>

class Product {
private:
    char * description;
    int productCode;
    int value;
    static int lastCode;
public:
    Product(char* descriptionP, int valueP) {

        productCode = ++lastCode;
        value = valueP;
        description = new char[strlen(descriptionP) + 1];
        strcpy(description, descriptionP);
    }

    Product(Product& other) {
        productCode = ++lastCode;
        description = new char[strlen(other.description) + 1];
        strcpy(description, other.description);
    }

    ~Product() {
        delete[] description;
    }

    char* getDescription() const {
        return description;
    }

    void setDescription(char* description) {
        this->description = description;
    }

    int getValue() const {
        return value;
    }

    void setValue(int value) {
        this->value = value;
    }

};

int Product::lastCode = 1000;

__global__ void priceSum(int *dA, int count, int * result) {

    int tid = blockIdx.x;

    if (tid < count)
        result+= dA[tid];

}

int main(void) {

    int enterNum, price, * result = 0;
    string desc;
    const char * desc2;


    cout << "How many products do you want to enter?";
    cin >> enterNum;

    int * h_A, * d_A;
    h_A = (int*) malloc(enterNum * sizeof(int));
    cudaMalloc((void **) &d_A, enterNum * sizeof(int));



    int i = 0;

    while (i < enterNum) {

        cout << "Enter product name:";
        cin >> desc;
        cout << "Enter product price:";
        cin >> price;

        Product p("Product", price);
        h_A[i] = p.getValue();
        i++;
    }


    cudaMemcpy(d_A, h_A, enterNum * sizeof(int), cudaMemcpyHostToDevice);
    priceSum<<<enterNum, 1>>>(d_A,enterNum,result);

    int  result2 = 0;

    cudaMemcpy(&result2, result, enterNum, cudaMemcpyDeviceToHost);
    cout << result2;

    return 0;
}

As I said on your very first question, if you don't understand basic C++ programming, you cannot hope to be able to write functional CUDA code. This code contains such elementary mistakes that it is clear you need to spend some time learning the basics of the language before worrying about CUDA. This question won't ever be of benefit to anyone else, and I have voted to close it. — talonmies, Apr 04 '13 at 05:55

Robert Crovella · Answer 1 · 2013-04-04T07:42:52.550

You should show the definition of result in your host code, but I assume it is:

int result;

based on how you are passing it to your priceSum kernel.

You have more than 1 problem here.

In your priceSum kernel, you are summing the values in dA[] and storing the answer in result. But you have passed the variable result to the kernel by value instead of by reference so the value you are modifying is local to the function, and will not show up anywhere else. When a function in C needs to modify a variable that is passed to it via the parameter list, and the modified variable is to show up in the function calling context, it's necessary to pass that parameter by reference (i.e. using a pointer) rather than by value. Note this is based on the C programming language and is not specific to CUDA. So you should rewrite your kernel definition as:
```
__global__ void priceSum(int *dA, int count, int *result) {
```
Regarding your cudaMemcpy call, there are several issues that need to be cleaned up. First, we need the storage for result to be properly created using cudaMalloc (before the kernel is called, because the kernel will store something there.) Next, we need to fix the parameter list of the cudaMemcpy call itself. So your host code should be rewritten as:
```
cudaMemcpy(d_A, h_A, enterNum, cudaMemcpyHostToDevice);
int *result;
cudaMalloc((void **)&result, sizeof(int));
priceSum<<<enterNum, 1024>>>(d_A,enterNum,result);

int  result2 = 0;

cudaMemcpy(&result2, result, sizeof(int), cudaMemcpyDeviceToHost);
```
There appear to be other problems with your code, around the grouping of data for threads and blocks. But you haven't shown enough of your program for me to make sense of it. So let me point out that your code shows only a single value for result (and result2), yet the way your kernel is written, each thread will add its value of dA[tid] to result. You can't have a bunch of threads all updating a single value in global memory with no control mechanism, and expect to get a sensible result. Problems like this are usually best handled with a classical parallel reduction algorithm, but for the sake of simplicity, to try and get something working, you can use atomics:
```
atomicAdd(result, dA[tid]);
```
Sorry, but your kernel just makes no sense at all. You are using blockIdx.x as your tid variable, but let's note that blockIdx.x is a number that is the same for every thread in a particular block. So then going on to have every thread add dA[tid] to result in this fashion just doesn't make sense. I believe it will make more sense if you change your kernel invocation to:
```
priceSum<<<enterNum, 1>>>(d_A,enterNum,result);
```

you didn't cudaMalloc `result` like I suggested and your last cudaMemcpy doesn't match mine. You may want to look at my point 2 again. also, you should do [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all cuda calls and kernel calls. You also didn't make the change I suggested in 3. Why are you wanting me to look at your code if you ignore my answer? One more thing that I missed in my answer is that you need to initialize `result` to zero, before the kernel, perhaps with cudaMemset. — Robert Crovella, Apr 04 '13 at 07:10

Adding values on GPU

1 Answers1