partial-ized forward method for a torch Model does not work well with multi-gpu jobs

Question

I am trying to understand why re-assigning the forward method of a pytorch model object leads to the following error under multi-gpu prediction job (configured automatically by huggingface trainer)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

This happens when I re-assign the forward method of my model object like so

model = CustomModel(...)
partial_kwargs = {'key1': value1, ..}
model.forward = partial(model.forward, **partial_kwargs)

If instead I pass partial_kwargs as constructor kwargs of CustomModel, I don't get the cuda device error above.

Please let me know if anything is unclear in the description and I can add more context. This question seems related but not the same, as I did not explicitly assign specific cuda devices in any part of the code.

partial-ized forward method for a torch Model does not work well with multi-gpu jobs

0 Answers0