I am training a PyTorch model. I am able to run the training script successfully on GPU instances (for instance on EC2 instances with pytorch_p36 conda evnrionment activated). Here is the script for reference:
https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train.py
I adapted the script to run under SageMaker but there i get this error generated by something smdebug is doing:
File "ss_training_entrypoint.py", line 400, in <module>
trainer.training(epoch)
File "ss_training_entrypoint.py", line 315, in training
loss = self.criterion(outputs, target)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 543, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 132, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 185, in _criterion_parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 160, in _worker
output = module(*(input + target), **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
hook_result = hook(self, input, result)
File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 156, in forward_hook
module_name = module._module_name
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 587, in __getattr__
type(self).__name__, name))
AttributeError: 'SegmentationLosses' object has no attribute '_module_name'
Does anyone know why this is happening and how to fix it?
Alternatively, is it possible to disable the smdebug hooks while not losing the SageMaker functionalities (i.e. Having the model trained and usable)?
Thanks very much!