Performing operations on CUDA matrices while reading from a global Point

Question

Hey there, I have a mathematical function (multidimensional which means that there's an index which I pass to the C++-function on which single mathematical function I want to return. E.g. let's say I have a mathematical function like that:

f = Vector(x^2*y^2 / y^2 / x^2*z^2)

I would implement it like that:

double myFunc(int function_index)
{       
    switch(function_index)
    {
        case 1:
            return PNT[0]*PNT[0]*PNT[1]*PNT[1];
        case 2:
            return PNT[1]*PNT[1];
        case 3:
            return PNT[2]*PNT[2]*PNT[1]*PNT[1];
    }
}

whereas PNT is defined globally like that: double PNT[ NUM_COORDINATES ]. Now I want to implement the derivatives of each function for each coordinate thus generating the derivative matrix (columns = coordinates; rows = single functions). I wrote my kernel already which works so far and which call's myFunc().

The Problem is: For calculating the derivative of the mathematical sub-function i concerning coordinate j, I would use in sequential mode (on CPUs e.g.) the following code (whereas this is simplified because usually you would decrease h until you reach a certain precision of your derivative):

f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;

now as I want to do this on the GPU in parallel, the problem is coming up: What to do with PNT? As I have to increase certain coordinates by h, calculate the value and than decrease it again, there's a problem coming up: How to do it without 'disturbing' the other threads? I can't modify PNT because other threads need the 'original' point to modify their own coordinate.

The second idea I had was to save one modified point for each thread but I discarded this idea quite fast because when using some thousand threads in parallel, this is a quite bad and probably slow (perhaps not realizable at all because of memory limits) idea.

'FINAL' SOLUTION So how I do it currently is the following, which adds the value 'add' on runtime (without storing it somewhere) via preprocessor macro to the coordinate identified by coord_index.

#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{           
    //*// Example: f[i] = x[i]^3
    return (X(function_index)*X(function_index)*X(function_index));
    // */
}

That works quite nicely and fast. When using a derivative matrix with 10000 functions and 10000 coordinates, it just takes like 0.5seks. PNT is defined either globally or as constant memory like __constant__ double PNT[ NUM_COORDINATES ];, depending on the preprocessor variable USE_CONST. The line return (X(function_index)*X(function_index)*X(function_index)); is just an example where every sub-function looks the same scheme, mathematically spoken:

f = Vector(x0^3 / x1^3 / ... / xN^3)

NOW THE BIG PROBLEM ARISES:

myFunc is a mathematical function which the user should be able to implement as he likes to. E.g. he could also implement the following mathematical function:

f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)

thus every function looking the same. You as a programmer should only code once and not depending on the implemented mathematical function. So when the above function is being implemented in C++, it looks like the following:

__device__ double myFunc(int function_index, int coordinate_index, double add)
{           
    double ret = 1.0;
    for(int i = 0; i < NUM_COORDINATES; i++)
        ret *= X(i)*X(i);
    return ret; 
}

And now the memory accesses are very 'weird' and bad for performance issues because each thread needs access to each element of PNT twice. Surely, in such a case where each function looks the same, I could rewrite the complete algorithm which surrounds the calls to myFunc, but as I stated already: I don't want to code depending on the user-implemented function myFunc...

Could anybody come up with an idea how to solve this problem?? Thanks!

Surely the first code snippet is not how you would *actually* define a functor for use as a part of a vectorization scheme? — talonmies, May 19 '11 at 11:00
Sorry I don't really get what you mean?! And how would that be related to the final problem I'm asking for a solution to? — tim, May 19 '11 at 11:06
I mean that the functor makes no sense on so many levels. And, because it is the basic premise upon which everything that follows, and because all functions are compiled inline in CUDA, it has an enormous bearing on how the final kernel code you have will perform. And it also probably means that what you might be worried about from an optimization point of view is irrelevant. Two suggestions for you - template the functor and pass the operation selector as a template argument, and use function pointers. — talonmies, May 19 '11 at 11:22
Still not really getting it :(. I just don't see how this could be relevant for me!? I thought that the main problem was the memory access from every sub-function to the 'mathematical point' stored in `PNT`. — tim, May 19 '11 at 11:26
Perhaps it would be useful to start from how the user supplies a function to work with your software. Evidently the user-defined function needs to be compiled in a way that allows it to reference `PNT` in the global address space. Moreover you hint that this shared data might be mutable in some cases (if not each thread could grab its own copy on startup). BTW it is well-known that for numerical differentiation, decreasing the step parameter h beyond a certain value leads to loss of accuracy through rounding errors and truncation. — hardmath, May 19 '11 at 12:41
I cannot understand what "user supplied functions" means in this case. CUDA doesn't support external linkage, so any function which would be used in this design must be compiled along with all the other device code within the same translation unit. The very best that can be done is a template based library (thrust is an excellent example of this). But this design seems to be the compete antithesis of a generic template library. Perhaps I am missing something obvious, but this seems completely unworkable on many levels. — talonmies, May 19 '11 at 16:17
I just cannot understand the problem of my explanation Oo The user defined function is some sort of mathematical function which he has to write into the .cu-file and than compile it to run it. That's all to do. And as he can write whichever mathematical function he likes, there's a big problem because I need to evaluate it to calculate the derivative. And when it's something like the last code snipped I posted (`x0^2*x1^2*...*xN^2`) it's a bad function to derivate regarding memory accesses onto the evaluation point `PNT` which I have no clue about how to optimize. — tim, May 19 '11 at 18:55
And I think there's probably no way to optimize since to evaluate a function it just HAS to use the coordinates from the point `PNT`. I really think about offering some of my reputation but I just think that nobody can come up with a solution onto this :( — tim, May 19 '11 at 18:56
There are solutions to the general mathematical problem you are asking about, but your question is both poorly constructed and, worse still, very prescriptive - you are asking for a solution in the context of the existing code structure you have. Unfortunately that code structure is bordering on the nonsensical. That is why it is a very hard question to answer, not because of an underlying technical reasons why the eventual outcome you want cannot be achieved. — talonmies, May 20 '11 at 08:35
Okay than please give me a example code of another new structure which would help me solving this problem. I really would completely rewrite the code if there's a solution to this but I still can't imagine how, because to calculate the derivative I definitely have to read the coordinates from `PNT` and this is, in my opinion, the bottleneck on GPUs. But okay, I would be glad, if you would come up with a productive comment with a code example. I know you posted already some 'keywords' on how to do it, but as I can't imagine how to use it to solve it, that didn't help me so far! — tim, May 20 '11 at 08:57

score 1 · Accepted Answer · answered May 21 '11 at 07:37

Rewinding back to the beginning and starting with a clean sheet, it seems you want to be able to do two things

compute an arbitrary scalar valued function over an input array
approximate the partial derivative of an arbitrary scalar valued function over the input array using first order accurate finite differencing

While the function is scalar valued and arbitrary, it seems that there are, in fact, two clear forms which this function can take:

A scalar valued function with scalar arguments
A scalar valued function with vector arguments

You appeared to have started with the first type of function and have put together code to deal with computing both the function and the approximate derivative, and are now wrestling with the problem of how to deal with the second case using the same code.

If this is a reasonable summary of the problem, then please indicate so in a comment and I will continue to expand it with some code samples and concepts. If it isn't, I will delete it in a few days.

In comments, I have been trying to suggest that conflating the first type of function with the second is not a good approach. The requirements for correctness in parallel execution, and the best way of extracting parallelism and performance on the GPU are very different. You would be better served by treating both types of functions separately in two different code frameworks with different usage models. When a given mathematical expression needs to be implemented, the "user" should make a basic classification as to whether that expression is like the model of the first type of function, or the second. The act of classification is what drives algorithmic selection in your code. This type of "classification by algorithm" is almost universal in well designed libraries - you can find it in C++ template libraries like Boost and the STL, and you can find it in legacy Fortran codes like the BLAS.

Okay thanks so far again. I think a main problem was previously, that I never knew how to call the different function types, because function can either mean a c++-function or a mathematical function (like f(x)=x^2) :) Okay now regarding your question: I have a arbitrary vector valued function with vector arguments. For dealing with the vector valued function I split it into several scalar valued functions, which still have vector arguments. (Vectors of about 1000-5000 rows). — tim, May 21 '11 at 09:12
And yes, I'm now struggling how to implement the evaluation of the scalar valued functions while keeping up a good performance. Each thread should calculate the derivative of a single scalar valued function (there are about 1000-5000 of them), but the algorithm for that is clear so far (http://stackoverflow.com/questions/627055/compute-a-derivative-using-discrete-methods/632538#632538) — tim, May 21 '11 at 09:13
I still don't understand. How is any of the code you have shown implementing vector valued functions? Every one appears to take either a scalar or vector as an input and produces a scalar output. The fact you are calculating this in parallel for many inputs doesn't mean the function is vector valued. A vector valued function maps one vector space to another. And you can't use the sort of derivative calculation you linked to for functions with vector inputs. Derivative computation in vector spaces is rather different. — talonmies, May 21 '11 at 10:15
It should be a vector valued function. I've shown you a piece of code where I implement a mathematical vector function (very first code snippet) via a scalar valued function (via `switch-case statement`, second code snippet). Hope this now helps you posting some proposals :) — tim, May 23 '11 at 10:01

Performing operations on CUDA matrices while reading from a global Point

1 Answers1