How to pass dependency files to sagemaker SKLearnProcessor and use it in Pipeline?

Question

I need to import function from different python scripts, which will used inside preprocessing.py file. I was not able to find a way to pass the dependent files to SKLearnProcessor Object, due to which I am getting ModuleNotFoundError.

Code:

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)


sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )

I would like to pass, dependent_files = ['file1.py', 'file2.py', 'requirements.txt']. So, that preprocessing.py have access to all the dependent modules.

And also need to install libraries from requirements.txt file.

Can you share any work around or a right way to do this?

Update-25-11-2021:

Q1.(Answered but looking to solve using FrameworkProcessor)

Here, the get_run_args function, is handling dependencies, source_dir and code parameters by using FrameworkProcessor. Is there any way that we can set this parameters from ScriptProcessor or SKLearnProcessor or any other Processor to set them?

Q2.

Can you also please show some reference to use our Processor as sagemaker.workflow.steps.ProcessingStep and then use in sagemaker.workflow.pipeline.Pipeline?

For having Pipeline, do we need sagemaker-project as mandatory or can we create Pipeline directly without any Sagemaker-Project?

have you tried to add extra `ProcessingInput` to download the source code? — Abdelrahman Maharek, Sep 06 '21 at 22:41
@AbdelrahmanMaharek Yes, I tried that. For that we need to add source code to `s3` and pass that path to `ProcessingInput`. In this case, I'm not able to install dependent modules. — shaik moeed, Sep 16 '21 at 05:25
@AbdelrahmanMaharek Can you please check the updated question? Is there any direction you can provide to resolve this? — shaik moeed, Nov 24 '21 at 08:00
sorry, I've just seen this. I can see that you have managed to find a workaround by @tulio-casagrande — Abdelrahman Maharek, Dec 02 '21 at 12:11
@AbdelrahmanMaharek Yes, using `ScriptProcessor` we can add custom `image_uri` which will be having libraries installed present in `requirements.txt`. — shaik moeed, Dec 03 '21 at 05:07

Tulio Casagrande · Accepted Answer · 2021-11-24T19:47:08.437

There are a couple of options for you to accomplish that.

One that is really simple is adding all additional files to a folder, example:

.
├── my_package
│   ├── file1.py
│   ├── file2.py
│   └── requirements.txt
└── preprocessing.py

Then send this entire folder as another input under the same /opt/ml/processing/input/code/, example:

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0",
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

sklearn_processor.run(
    code="preprocessing.py",  # <- this gets uploaded as /opt/ml/processing/input/code/preprocessing.py
    inputs=[
        ProcessingInput(source=input_data, destination='/opt/ml/processing/input'),
        # Send my_package as /opt/ml/processing/input/code/my_package/
        ProcessingInput(source='my_package/', destination="/opt/ml/processing/input/code/my_package/")
    ],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

What happens is that sagemaker-python-sdk is going to put your argument code="preprocessing.py" under /opt/ml/processing/input/code/ and you will have my_package/ under the same directory.

Edit:

For the requirements.txt, you can add to your preprocessing.py:

import sys
import subprocess

subprocess.check_call([
    sys.executable, "-m", "pip", "install", "-r",
    "/opt/ml/processing/input/code/my_package/requirements.txt",
])

Thanks for the detailed explanation. Can I assume that `input_data` as a folder in s3 bucket? I mean `input_data` contains multiple folders and data files that needed for `preprocessing.py` file and I will pass root folder will all the data as input i.e., `input_data`. — shaik moeed, Nov 25 '21 at 04:46
Can you also please show some reference to use `.run(...` as `sagemaker.workflow.steps.ProcessingStep` and then use in `sagemaker.workflow.pipeline.Pipeline`? — shaik moeed, Nov 25 '21 at 05:02
Exactly as you said, if your `input_data` is pointing to an s3 prefix (folder), such as `s3://my_bucket/data/`, all files from `data/` are sent to the job. [Here's a reference](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html) on how to use a `ProcessingStep` instead of manually calling `.run(...)`, most arguments are basically the same. — Tulio Casagrande, Nov 26 '21 at 18:21
This is great. Is there a reference/documentation that states that the code goes to "/opt/ml/processing/input/code/"? Haven't been able to find it, only some vague hints in some examples. — oW_, Mar 15 '22 at 17:57
hey @oW_, not really a documentation, but you can check [here](https://github.com/aws/sagemaker-python-sdk/blob/7e2c7ab480b2b99ec6d580c1bf9e3be849c8db80/src/sagemaker/processing.py#L669-L692) in the SDK where the `code` argument is handled. — Tulio Casagrande, Mar 15 '22 at 20:44

shark8me · Answer 2 · 2021-11-25T12:44:45.227

0

This isn't supported in SKLearnProcessor. You'd need to package your dependencies in docker image and create a custom Processor (e.g. a ScriptProcessor with the image_uri of the docker image you created.)

edited Nov 25 '21 at 12:44

answered Nov 12 '21 at 10:39

shark8me

638
8
9

[Here](https://github.com/aws/sagemaker-python-sdk/blob/99f023e76a5db060907a796d4d8fee550f005844/src/sagemaker/processing.py#L1426), the `get_run_args` function, is handling `dependencies`, `source_dir` and `code` parameters. Is there any way that we can set this parameters from `ScriptProcessor` or any other `Processor` to set them? – shaik moeed Nov 24 '21 at 07:56

How to pass dependency files to sagemaker SKLearnProcessor and use it in Pipeline?

2 Answers2

Linked