Sagemaker Train Job can't connect to ec2 instance

Question

I have MLFlow server running on ec2 instance, port 5000.

This ec2 instance has security group with opened TCP connection on port 5000 to another security group designated for SageMaker.

ec2 instance inbound rules:

SageMaker outbound rules:

These 2 security groups are in the same VPC

Now, I try to run SageMaker training job with designated security group, so that the training script will log metrics to ec2 server via internal IP address. (As answered here), but connection fails

SageMaker job init:

   role = "ml_sagemaker"
   security_group_ids = ['sg-04868acca16e81183']
   bucket = sagemaker_session.default_bucket()  
   out_path = f"s3://{bucket}/{project_name}"

   estimator = PyTorch(entry_point='run_train.py',
                       source_dir='.',
                       sagemaker_session=sagemaker_session,
                       instance_type=instance_type,
                       instance_count=1,
                       framework_version='1.5.0',
                       py_version='py3',
                       role=role,
                       security_group_ids=security_group_ids,
                       hyperparameters={},
                       )
   ....

Inside run_train.py:

import mlflow
tracking_uri = "http://172.31.77.137:5000"  # <- this is internal ec2 IP
mlflow.set_tracking_uri(tracking_uri)
mlflow.log_param("test_param", 3)

Error:

File "/opt/conda/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

However, when when I create SageMaker Notebook instance with the same security group and the same IAM role, I am able to successfully connect to ec2 and log metrics from within the Notebook.

Here is SageMaker Notebook configurations:

How can I connect to ec2 instance from SageMaker Training Job?

score 4 · Answer 1 · edited May 20 '21 at 11:00

Your estimator will create a standalone instance so it does not matter if you are able to access the mlflow from the notebook. If you wish to use Subnet/Security group configuration with “ PyTorch” estimator with internet connection, you need to set VPC resource.

I had this same issue, Sagemaker plus MLflow Server on another ec2. The first instinct is to assign the estimator the same VPC and security groups as the ec2(MLflow Server). They should be able to connect to each other since they are within the same private net. Here comes another problem, the instance that Sagemaker spins up cannot connect to internet to download the libraries/packages you specify in requirements.txt(ie, mlflow). Then the problem is how to connect to the internet.

The only way to provide internet access when subnet are used for estimators is by having it in a subnet with NAT gateway configured.

Create a NAT gateway in one of your public subnets such as subnet-axxxx
Create a new route table as “NAT_Route_Table”
Edit routes: Destination add 0.0.0.0/0, Target add Newly create NAT gateway (add other routes if needed)
Create a new subnet named “NAT_Subnet” and associate it to the newly created “NAT_Route_Table”

Traffic will go through NAT to the internet.

Sagemaker Train Job can't connect to ec2 instance

1 Answers1