How to reuse functors with member data over many kernel executions in CUDA to improve memory usage and decrease copy time?

Question

I am translating a c++11 program which calculates contact forces between particle pairs into a cuda program. All the particle pairs are independent from each other. I use a functor to calculate the contact force. This functor does many computations and contains a lot of member variables. Therefore I am trying to reuse the functors, instead of making one new functor per particle pair.

Because the functor contains virtual functions, the functor cloning is done on the device instead of on the host.

I am thinking of a scheme which goes like this:

1) Clone M functors

2) Start computing M particle pairs

3) Particle pair M+1 waits until one particle pair has completed and then reuses its functor

However, other ideas are also very welcome.

I've made a very simplified version of the program. In this play program, the F variable does not have to be a member variable, but in the real program it needs to be. There is also a lot more member data and particle pairs (N) in the real program. N is often a few million.

#include <stdio.h>

#define TPB 4 // realistic value = 128
#define N 10  // realistic value = 5000000
#define M 5   // trade of between copy time and parallel gain.
              // Realistic value somewhere around 1000 maybe

#define OPTION 1
// option 1: Make one functor per particle pair => works, but creates too many functor clones
// option 2: Only make one functor clone => no more thread independent member variables
// option 3: Make M clones which get reused => my suggestion, but I don't know how to program it

struct FtorBase
{
  __device__ virtual void execute(long i) = 0;

  __device__ virtual void show() = 0;
};

struct FtorA : public FtorBase
{

  __device__ void execute(long i) final
  {
    F = a*i;
  }

  __device__ void show() final
  {
    printf("F = %f\n", F);
  }

  double a;
  double F;
};

template <class T>
__global__ void cloneFtor(FtorBase** d_ftorBase, T ftor, long n_ftorClones)
{
  const long i = threadIdx.x + blockIdx.x * blockDim.x;

  if (i >= n_ftorClones) {
    return;
  }

  d_ftorBase[i] = new T(ftor);
}

struct ClassA
{
  typedef FtorA ftor_t;

  FtorBase** getFtor()
  {
    FtorBase** d_cmFtorBase;
    cudaMalloc(&d_cmFtorBase, N * sizeof(FtorBase*));

#if OPTION == 1 
    // option 1: Create one copy of the functor per particle pair
    printf("using option 1\n");
    cloneFtor<<<(N + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, N);
#elif OPTION == 2
    // option 2: Create just one copy of the functor
    printf("using option 2\n");
    cloneFtor<<<1, 1>>>(d_cmFtorBase, ftor_, 1);
#elif OPTION == 3
    // option 3: Create M functor clones
    printf("using option 3\n");
    printf("This option is not implemented. I don't know how to do this.\n");
    cloneFtor<<<(M + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, M);
#endif
    cudaDeviceSynchronize();

    return d_cmFtorBase;
  }

  ftor_t ftor_;
};


__global__ void cudaExecuteFtor(FtorBase** ftorBase)
{
  const long i = threadIdx.x + blockIdx.x * blockDim.x;

  if (i >= N) {
    return;
  }

#if OPTION == 1
  // option 1: One functor per particle was created
  ftorBase[i]->execute(i);
  ftorBase[i]->show();
#elif OPTION == 2
  // option 2: Only one single functor was created
  ftorBase[0]->execute(i);
  ftorBase[0]->show();
#elif OPTION == 3
  // option 3: Reuse the fuctors
  // I don't know how to do this
#endif
}

int main()
{
  ClassA* classA = new ClassA();
  classA->ftor_.a = .1;

  FtorBase** ftorBase = classA->getFtor();

  cudaExecuteFtor<<<(N + TPB - 1) / TPB, TPB>>>(ftorBase);
  cudaDeviceSynchronize();

  return 0;
}

I am checking the output of F to see whether the member variable is independent in each call. As expected, when using a different functor for each particle pair (option 1), all the F values are different and when using only one functor for the whole program (option 2), all the F values are the same.

using option 1
F = 0.800000
F = 0.900000
F = 0.000000
F = 0.100000
F = 0.200000
F = 0.300000
F = 0.400000
F = 0.500000
F = 0.600000
F = 0.700000

using option 2
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000

I wonder if there is a way to get all different F values in this play example without taking N copies (option 3).

PS: I am using Ubuntu 18.04, nvcc 9.1 and a NVIDIA GeForce GTX 1060 Mobile graphics card (cuda compatability 6.1).

UPDATE:

In the previous code I presented, there was only a problem in debug mode (comilation with -G flag) but not in the release version. I'm guessing that the compiler optimised printf("F = %f\n", F); to printf("F = %f\n", a*i); so that the problem of thread dependent member variables, what this question is about, disappeared.

I updated the code, so the compiler cannot do the substitution in the printf anymore.

I'm confused. When I run your code, either with option 1 or with option 2, I don't get uniform output as you have shown with `F = 25.500000`. Is that output actually generated from **this** code that you have posted? [here](https://pastebin.com/GWgJNzeC) is my test case on CUDA 10.1 — Robert Crovella, May 16 '19 at 06:15
I see that in your test case, you have defined N to be 10. In this case you get all results for `F = a * {0, 1, ... , N -1, N}`. `a = 0.1`, thus `F = {0.1, 0.2, ..., 0.9, 1.0}`. I defined N to be 1000, so in my test case `F = {0.1, 0.2, ..., 99.9, 100}`. With option 1, I get all the values in the collection. When using option 2, the program prints one of the possibilities randomly (`F = 25.5` in my case). I'm only showing 10 of the 1000 values that are printed. I will edit my question to clarify this. — thomasbangels, May 16 '19 at 06:34
You're not doing any CUDA error checking (that I can see) and I don't know if you are running on windows or linux. I suggest [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) any time you are having difficult with a CUDA program, and also running your program with `cuda-memcheck`. I'd encourage you to do that before going much farther. — Robert Crovella, May 16 '19 at 06:51
PS: I copy pasted your code to my system and when I use option 2, I always get `F = 0.700000`. The cuda-memcheck reports no errors in both cases. I'm using linux. — thomasbangels, May 16 '19 at 06:52
So the difference in behavior could be the result of a compile flag maybe. — thomasbangels, May 16 '19 at 06:59
Yes, when I don't run in debug mode (no -G flag), I get the same result as @Robert Crovella — thomasbangels, May 16 '19 at 07:02
The difference in behavior between release and debug versions is not because of a bug in the compiler. I think it's because the release version optimizes `printf("F = %f\n", F);` to `printf("F = %f\n", a*i);`. I slightly changed the functor so it uses a different function for the calculation of F and the print of F so it cannot do the optimization anymore. Now the release version also prints all the same F values when using option 2. — thomasbangels, May 16 '19 at 08:46
I also don't expect option 2 to work. I added this to show exactly why functor clones are needed. However my questions is about how I can limit the amount of clones to M instead of N. Because N will be about 1e6 in the real program and if I make one functor clone for each N, then all my memory resources and execution time will be spend on the functor cloning. — thomasbangels, May 16 '19 at 08:59
Yes, I agree that due to the "shared" usage of `F`, that sharing a "single" functor is not workable. I deleted some of my previous comments that indicated otherwise. It seems to me that if you must have a member variable like `F` that will hold different data per thread, then you will need one variable like `F` per thread. — Robert Crovella, May 16 '19 at 13:14
Yes that is true, but as I only have 2048 threads which can run simultaneously, I wonder if it is possible to reuse the functors instead of creating millions of functors when I have millions of particle pairs. — thomasbangels, May 16 '19 at 16:19
2048 probably isn't the right number of threads that can be in flight. Unless you are on the very smallest GPU, the number is larger than that. — Robert Crovella, May 16 '19 at 16:22
You are right, I believe the maximal number of concurrent running threads on one multiprocessor is 2048. But because I have 10 SM's, the maximal number of concurrent running threads on the whole GPU is 20480 I think. — thomasbangels, May 16 '19 at 16:41
Yes, that sounds right. It [is possible](https://stackoverflow.com/questions/37362554/minimize-the-number-of-curand-states-stored-during-a-mc-simulation/37381193#37381193) to manage a resource based on the threads in flight, but there is a lot of complexity associated with that. I suspect there is a simpler solution to your desire, but for me anyway it would require more thought to figure it out. At least perhaps we can agree that if you really require a temporary variable like `F` per thread, there doesn't seem to be a way to get around at least having some quantity greater than 1 of those. — Robert Crovella, May 16 '19 at 17:00

How to reuse functors with member data over many kernel executions in CUDA to improve memory usage and decrease copy time?

0 Answers0