I have a Scheduler
class which contains a list of Client
objects, all with their own Pytorch models, parameters and training functions. I am trying to train multiple clients in parallel as I have multiple GPUs and each Client is assigned a GPU.
The basic code structure is like this:
import torch.multiprocessing as mp
class Scheduler:
def __init__(self, num_clients):
self.clients = [] # Client1, ..., ClientN
def client_update(self, client):
print("Client {}".format(client.id))
client.train()
client.evaluate(self.dataset.test_dataloader)
def train(self, num_rounds):
for round in range(num_rounds):
processes = []
for client in self.clients:
process = mp.Process(target=self.client_update, args=(client, ))
process.start()
processes.append(process)
for process in processes:
process.join()
The Scheduler
class is initialised in the main script and the train
function is called there. Within the if guard I set mp.set_start_method('spawn', force=True)
.
This method doesn't seem to work as, the Process
creates a new Client
object and I run into an EOFError: Ran out of input
error, similar to this. Unfortunately I cannot use the same solution as in this thread.
Tried using a Pool
method but couldn't get that working unfortunately.
ctx = mp.get_context('forkserver')
pool = ctx.Pool(2)
pool.map(
functools.partial(self.client_update,),
self.clients)
pool.close()
I am unsure what the best method would be to be able to use the GPUs efficiently to speed up the process for the clients.