I have MLFlow server running on ec2 instance, port 5000.
This ec2 instance has security group with opened TCP connection on port 5000 to another security group designated for SageMaker.
These 2 security groups are in the same VPC
Now, I try to run SageMaker training job with designated security group, so that the training script will log metrics to ec2 server via internal IP address. (As answered here), but connection fails
SageMaker job init:
role = "ml_sagemaker"
security_group_ids = ['sg-04868acca16e81183']
bucket = sagemaker_session.default_bucket()
out_path = f"s3://{bucket}/{project_name}"
estimator = PyTorch(entry_point='run_train.py',
source_dir='.',
sagemaker_session=sagemaker_session,
instance_type=instance_type,
instance_count=1,
framework_version='1.5.0',
py_version='py3',
role=role,
security_group_ids=security_group_ids,
hyperparameters={},
)
....
Inside run_train.py
:
import mlflow
tracking_uri = "http://172.31.77.137:5000" # <- this is internal ec2 IP
mlflow.set_tracking_uri(tracking_uri)
mlflow.log_param("test_param", 3)
Error:
File "/opt/conda/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
However, when when I create SageMaker Notebook instance with the same security group and the same IAM role, I am able to successfully connect to ec2 and log metrics from within the Notebook.
Here is SageMaker Notebook configurations:

How can I connect to ec2 instance from SageMaker Training Job?