I am translating a c++11 program which calculates contact forces between particle pairs into a cuda program. All the particle pairs are independent from each other. I use a functor to calculate the contact force. This functor does many computations and contains a lot of member variables. Therefore I am trying to reuse the functors, instead of making one new functor per particle pair.
Because the functor contains virtual functions, the functor cloning is done on the device instead of on the host.
I am thinking of a scheme which goes like this:
1) Clone M functors
2) Start computing M particle pairs
3) Particle pair M+1 waits until one particle pair has completed and then reuses its functor
However, other ideas are also very welcome.
I've made a very simplified version of the program. In this play program, the F variable does not have to be a member variable, but in the real program it needs to be. There is also a lot more member data and particle pairs (N) in the real program. N is often a few million.
#include <stdio.h>
#define TPB 4 // realistic value = 128
#define N 10 // realistic value = 5000000
#define M 5 // trade of between copy time and parallel gain.
// Realistic value somewhere around 1000 maybe
#define OPTION 1
// option 1: Make one functor per particle pair => works, but creates too many functor clones
// option 2: Only make one functor clone => no more thread independent member variables
// option 3: Make M clones which get reused => my suggestion, but I don't know how to program it
struct FtorBase
{
__device__ virtual void execute(long i) = 0;
__device__ virtual void show() = 0;
};
struct FtorA : public FtorBase
{
__device__ void execute(long i) final
{
F = a*i;
}
__device__ void show() final
{
printf("F = %f\n", F);
}
double a;
double F;
};
template <class T>
__global__ void cloneFtor(FtorBase** d_ftorBase, T ftor, long n_ftorClones)
{
const long i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= n_ftorClones) {
return;
}
d_ftorBase[i] = new T(ftor);
}
struct ClassA
{
typedef FtorA ftor_t;
FtorBase** getFtor()
{
FtorBase** d_cmFtorBase;
cudaMalloc(&d_cmFtorBase, N * sizeof(FtorBase*));
#if OPTION == 1
// option 1: Create one copy of the functor per particle pair
printf("using option 1\n");
cloneFtor<<<(N + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, N);
#elif OPTION == 2
// option 2: Create just one copy of the functor
printf("using option 2\n");
cloneFtor<<<1, 1>>>(d_cmFtorBase, ftor_, 1);
#elif OPTION == 3
// option 3: Create M functor clones
printf("using option 3\n");
printf("This option is not implemented. I don't know how to do this.\n");
cloneFtor<<<(M + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, M);
#endif
cudaDeviceSynchronize();
return d_cmFtorBase;
}
ftor_t ftor_;
};
__global__ void cudaExecuteFtor(FtorBase** ftorBase)
{
const long i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= N) {
return;
}
#if OPTION == 1
// option 1: One functor per particle was created
ftorBase[i]->execute(i);
ftorBase[i]->show();
#elif OPTION == 2
// option 2: Only one single functor was created
ftorBase[0]->execute(i);
ftorBase[0]->show();
#elif OPTION == 3
// option 3: Reuse the fuctors
// I don't know how to do this
#endif
}
int main()
{
ClassA* classA = new ClassA();
classA->ftor_.a = .1;
FtorBase** ftorBase = classA->getFtor();
cudaExecuteFtor<<<(N + TPB - 1) / TPB, TPB>>>(ftorBase);
cudaDeviceSynchronize();
return 0;
}
I am checking the output of F to see whether the member variable is independent in each call. As expected, when using a different functor for each particle pair (option 1), all the F values are different and when using only one functor for the whole program (option 2), all the F values are the same.
using option 1
F = 0.800000
F = 0.900000
F = 0.000000
F = 0.100000
F = 0.200000
F = 0.300000
F = 0.400000
F = 0.500000
F = 0.600000
F = 0.700000
using option 2
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
I wonder if there is a way to get all different F values in this play example without taking N copies (option 3).
PS: I am using Ubuntu 18.04, nvcc 9.1 and a NVIDIA GeForce GTX 1060 Mobile graphics card (cuda compatability 6.1).
UPDATE:
In the previous code I presented, there was only a problem in debug mode (comilation with -G
flag) but not in the release version. I'm guessing that the compiler optimised printf("F = %f\n", F);
to printf("F = %f\n", a*i);
so that the problem of thread dependent member variables, what this question is about, disappeared.
I updated the code, so the compiler cannot do the substitution in the printf anymore.