8

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.

I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.

When I try to submit the Processing Job I get an error -

............................Traceback (most recent call last):
  File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
    import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'

I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.

I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?

Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?

iCHAIT
  • 525
  • 6
  • 16

2 Answers2

6

One way would be to call pip from Python:

subprocess.check_call([sys.executable, "-m", "pip", "install", package])

Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you

estimator = SKLearn(
    entry_point="foo.py",
    source_dir="./foo", # no trailing slash! put requirements.txt here
    framework_version="0.23-1",
    role = ...,
    instance_count = 1,
    instance_type = "ml.m5.large"
)
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
0

Another thing you can do is by having a bash script instead of a python file as entrypoint.

I defined the entrypoint in CloudFormation like this:

...
MyProcessingJob:
  Type: Task
  Resource: arn:aws:states:::sagemaker:createProcessingJob.sync
  Parameters:
    AppSpecification:
      ContainerEntrypoint: ["bash", "/opt/ml/processing/input/code/start_process.sh"]
      ImageUri: "492215442770.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3"
...

The bash script could look something like:

# cd into folder
cd /opt/ml/processing/input/code

# install requirements
pip install -r requirements.txt

# start preprocessing
/miniconda3/bin/python -m entryscript --parameter value

This is what I used on the standard SKLearn Docker Image and it works great.

Lukas Hestermeyer
  • 830
  • 1
  • 7
  • 19