1

I am testing a way to speed up "high level" computations using multiprocessing. The idea is to have multiple processes (let's call them G) doing different executions of the same task T. This can be a long task (playing a whole board game) which in the end returns some results. It can be asynchronous and I know already how to gather all results using multiprocessing and apply_async.

However, at some point, T needs a call to GPU functions. My idea was to create another process S which would act as a service, gathering data from registered Gs, calling GPU function (tensorflow NN evaluation) on the gathered data (while putting Gs on halt) and "sending" back the results to all corresponding Gs.

I looked at the answers of How do I make processes able to write in an array of the main program?. However, the difference here is that the gathering does not happen only at the end of the tasks.

Do you think this is possible? I also tried different approaches using ctypes and OpenMP without success.

Here is a pseudo code of what I would like to do:

shared_service = Service()

class Worker():
    def __init__(self):
       shared_service.register(self)
       ...

    def run(self):
       finished = False
       while (not finished):
            ... do my stuff ...
            ... gather data to "send" to GPU ...
            shared_service.request(data, self.callback)
            ... wait for result ...
            ... use result ...
            ... do more stuff ...
    def callback(self, result):
        ... store result ...

class Service():
    ...
    def register(self, o):
        ... register new "client" ...

    def request(self, o, data):
        ... add request to current buffer(and keep track of requester)...

    def run(self):
        while(True):
            ... wait for full buffer ...
            ... call GPU function ...
            ... dispatch results to "clients" ...

main:
  ... init one "Service" ...
  ... init N "Worker" ...

  ... run N Workers asynchronously ...

Thank you for your help!

BilboX
  • 11
  • 1

1 Answers1

0

The way that the underlying architecture for GPUs works, is that all the cores inside of it execute the same instruction, but on a different part of the data array. Another significant architectural feature is that the CPU memory is copied to GPU memory, used by the GPU for the full execution of the kernel (series of instructions), and then the result is copied back to the CPU to do whatever with.

So you are likely

  1. only executing a single unique task against the GPUs, no matter how you code the higher layer and waiting the full duration of its execution (even if some of the parallel code finishes early)
  2. calling the tensorflow API and letting it handle all of the GPU management (or whatever API you choose)
  3. use the builtin async (found here) and not worry about GPUs, instead reaping the benefits of multiprocessor environments anyway (depending on what the task actually is)
carrvo
  • 511
  • 5
  • 11
  • Hi. Indeed, I know that the GPU is the bottleneck and that it will have to way for all the CPU's data to be computed. Nonetheless, parallelizing those CPU work is still a way to improve execution time. Furthermore, without parallel work, the same happen: GPU is bottleneck. However, calling the tf GPU operation is not very longer an a batch of 64*N (N being the number of parallel cpu workers) than only one batch of 64. – BilboX Jan 21 '21 at 20:09