When running a simple training job on Amazon SageMaker, ProfilerReport (not configured by me) is also enabled by default and a processing job appears parallel to the training job.
The training job runs successfully, but a few times (so I don't know how to reproduce the error) the profiler report goes into generic error saying:
InternalServerError: An internal error occurred. Try again.
Looking at the CloudWatch logs, the last few are all like this:
Put the output notebook in /opt/ml/processing/output/rule/profiler-output/profiler-report.ipynb
Put the html in /opt/ml/processing/output/rule/profiler-output/profiler-report.html
Current timestamp 1666357140000000 last timestamp 1666357080000000: waiting for new profiler data.
Current timestamp 1666357140000000 most recent timestamp 1666357080000000: waiting for new profiler data.
Current timestamp 1666357140000000 most recent timestamp 1666357080000000: waiting for new profiler data.
......
repeating to the last this waiting for new profiler data.
The job in question lasted 2 days, but the profiler report failed after 20 hours. Looking at the instance parameters, there is no error in terms of resources used.
The only point I can think of is that I configured early stopping (with saving only the best model, progressively) and so in the last training phase it does not save any data.
Could the explanation then be that by not saving anything, the profiler report goes into timeout? The ProfilerReport though, shouldn't it also show a lot of other information about the training job by looking at the debugger like gpu utilization and more?
This is the simple example of the training job code:
from sagemaker.pytorch import PyTorch
tft_train_estimator = PyTorch(
base_job_name="my-training-job-name"
entry_point="training.py",
framework_version="1.12.0",
py_version="py38",
role=role,
instance_count=1,
instance_type=train_instance_type,
code_location = code_location,
output_path=output_model_path
)
In each case, the trained model works correctly.