SageMaker smdebug generating AttributeError with Pytorch container

Question

I am training a PyTorch model. I am able to run the training script successfully on GPU instances (for instance on EC2 instances with pytorch_p36 conda evnrionment activated). Here is the script for reference:

https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train.py

I adapted the script to run under SageMaker but there i get this error generated by something smdebug is doing:

 File "ss_training_entrypoint.py", line 400, in <module>
    trainer.training(epoch)
  File "ss_training_entrypoint.py", line 315, in training
    loss = self.criterion(outputs, target)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 543, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 132, in forward
    outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 185, in _criterion_parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 160, in _worker
    output = module(*(input + target), **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
    hook_result = hook(self, input, result)
  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 156, in forward_hook
    module_name = module._module_name
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 587, in __getattr__
    type(self).__name__, name))
AttributeError: 'SegmentationLosses' object has no attribute '_module_name'

Does anyone know why this is happening and how to fix it?

Alternatively, is it possible to disable the smdebug hooks while not losing the SageMaker functionalities (i.e. Having the model trained and usable)?

Thanks very much!

SageMaker smdebug generating AttributeError with Pytorch container

0 Answers0