1

this is a very tricky one.

I'm very new to Amazon Sagemaker and I can't seem to find any answer to this problem. I don't know if what I want to do is even possible.

Basically, suppose I have a notebook instance on Amazon Sagemaker, I want to connect this notebook instance automatically to:

  • A specific S3 bucket (or even a specific sub-directory inside an S3 bucket)
  • A specific sub-directory in a remote Git Repository (hosted in GitHub/BitBucket/other platforms)

And this has to be done automatically every time the notebook instance is started. Would something like this be possible?

I've tried looking at lifecycle configurations but as I'm not fully aware of its capabilities, I don't know if it's even possible to do this with the lifecycle config bash script.

I'm very open to other ideas if anyone know how to do something similar to this even if it means that I have to tinker with AWS CLI/Sagemaker SDK/API/GitHub and BitBucket API, other AWS services like lambda, etc.

Thanks heaps in advance!

blue2609
  • 841
  • 2
  • 10
  • 25

1 Answers1

0

Not entirely sure what does it mean to "connect" a specific S3 bucket to a Notebook instance, but assume you would like to download the content to the underlying EBS volume of your instance. For git, my assumption is that you'd like to clone a specific subfolder from a repository.

For doing all of these automatedly, you can use Lifecycle Configuration Scripts, as you mentioned. For S3, from the LCC script you can call the AWS CLI to download specific objects or entire buckets/prefixes (for multiple files, use the aws s3 sync command). The only caveat here that the Execution Role what you have set for your Notebook instance must have read access to those S3 objects. This role determines what you can access from your notebook instance (and not the policies set for your IAM user).

For git repository cloning, you can just call the git command from the LCC script. For a long time it was not possible to clone just a subfolder only from a repository, but finally there is a solution for this, please see the following post: How do I clone a subdirectory only of a Git repository?

andras
  • 155
  • 4
  • Mate, OMG thanks so much! Right right, so it is possible to do those things. And yes, you're absolutely correct in your assumption, what I want is to download the content of an S3 bucket to the EBS volume of my instance OR clone a git repository to the EBS volume of my notebook instance. .... but what about private git repository that requires SSH key though? Would also be great if we can specify which S3 bucket used to save the training/validation data in LCC but again I don't know if that's possible. Thanks heaps! – blue2609 Jul 30 '21 at 01:22