0

I have a Scheduler class which contains a list of Client objects, all with their own Pytorch models, parameters and training functions. I am trying to train multiple clients in parallel as I have multiple GPUs and each Client is assigned a GPU.

The basic code structure is like this:

import torch.multiprocessing as mp

class Scheduler:
    def __init__(self, num_clients):
        self.clients = [] # Client1, ..., ClientN

    def client_update(self, client):
        print("Client {}".format(client.id))
        client.train()
        client.evaluate(self.dataset.test_dataloader)

    def train(self, num_rounds):
        for round in range(num_rounds):
            processes = []

            for client in self.clients:
                process = mp.Process(target=self.client_update, args=(client, ))
                process.start()
                processes.append(process)

            for process in processes:
                process.join()

The Scheduler class is initialised in the main script and the train function is called there. Within the if guard I set mp.set_start_method('spawn', force=True).

This method doesn't seem to work as, the Process creates a new Client object and I run into an EOFError: Ran out of input error, similar to this. Unfortunately I cannot use the same solution as in this thread.

Tried using a Pool method but couldn't get that working unfortunately.

    ctx = mp.get_context('forkserver')
    pool = ctx.Pool(2)
    pool.map(
        functools.partial(self.client_update,),
        self.clients)
    pool.close()           

I am unsure what the best method would be to be able to use the GPUs efficiently to speed up the process for the clients.

0 Answers0