0

I am thinking about using the strategy detailed in this post to run a python script on another AWS instance (from an AWS instance) using SSH. However, this python script can take days to finish, and I am concerned that the SSH connection might break, causing the script to stop on the remote instance (it's not the end of the world if that happens, but it means I have to restart the script from scratch so this couldn't happen frequently). How probable is it that an SSH connection between two AWS instances breaks over the course of a few days? Are there any simple ways to make such a connection more stable while still maintaining the console forwarding that SSH affords?

(I can't use AWS's SSM because of the max timeout value of 48 hours on an SSM command)

1 Answers1

2

SSH is designed to provide a login to an interactive shell. It is not a good way to architect inter-machine communications.

I would recommend a loosely-coupled approach:

  • Instance-A pushes a work request to an Amazon SQS queue
  • Instance-B has a 'worker' app waiting for work. It regularly polls the SQS queue waiting for work. When the work request is received, it processes the request.

You would also need to decide what to do if the worker fails or does not complete the work. Normally, the message would timeout and reappear on the queue for another worker to process. However, your scenario might be much simpler, having only one worker.

The benefit of sending the work request via SQS is that work can be queued, waiting for the worker to complete a previous task. Also, multiple workers can be spawned if you wish to process jobs in parallel.

See: Queuing Chain Pattern - AWS-CloudDesignPattern

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Thanks for the response! I was hoping to use something like SSH because console forwarding is important (I will be periodically checking in on the script running in a Tmux shell on instance A, and hope to see the output of Instance B's script in that same shell). Basically I need exactly what SSH does except without the potential for disconnection. And I'm hoping exceptions could be forwarded too like they would be using subprocess/SSH. I could engineer my own pseudo-log-streaming system using s3 but I am trying to keep things simple. – Charles Cunningham Aug 12 '20 at 03:38
  • For more info on my use-case, I am using instance A as a dev box and instance B as a (more expensive per-hour) GPU box for CNN model training. The other benefit of this is the dev box can run multiple training sessions at once. I want to be able to check in on the model training but want the script running on the dev box to automatically shut down instance B when it finishes (I have already written the code for this using boto3). I am trying to limit complexity because I am going to be transferring this project to other developers and want to limit the learning curve on how the system works. – Charles Cunningham Aug 12 '20 at 03:40
  • Or alternatively SSM would work for my purpose, but unfortunately the maximum executionTimeout is 48 hours and I will most likely be running training sessions which take longer than that. Do you know of any way to exceed a runtime of 48 hours on an SSM command? – Charles Cunningham Aug 12 '20 at 18:18