How to install additional packages in sagemaker pipeline

Question

I want to add dependency packages in my sagemaker pipeline which will be used in Preprocess step.

I have tried to add it in required_packages in setup.py file but it's not working. I think setup.py file is no use of at all.

required_packages = ["sagemaker==2.93.0", "matplotlib"]

Preprocessing steps:

sklearn_processor = SKLearnProcessor(
        framework_version="0.23-1",
        instance_type=processing_instance_type,
        instance_count=processing_instance_count,
        base_job_name=f"{base_job_prefix}/job-name",
        sagemaker_session=pipeline_session,
        role=role,
    )
    step_args = sklearn_processor.run(
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
            ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
        ],
        code=os.path.join(BASE_DIR, "preprocess.py"),
        arguments=["--input-data", input_data],
    )
    step_process = ProcessingStep(
        name="PreprocessSidData",
        step_args=step_args,
    )

Pipeline definition:

pipeline = Pipeline(
        name=pipeline_name,
        parameters=[
            processing_instance_type,
            processing_instance_count,
            training_instance_type,
            model_approval_status,
            input_data,
        ],
        steps=[step_process],
        sagemaker_session=pipeline_session,
    )

Can you post the code you are using for the preprocess step and pipeline definition? — Giuseppe La Gualano, Nov 23 '22 at 17:30

Giuseppe La Gualano · Accepted Answer · 2022-11-23T19:27:48.477

For each job within the pipeline you should have separate requirements (so you install only the stuff you need in each step and have full control over it).

To do this, you need to use the source_dir parameter:

source_dir (str or PipelineVariable) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory are preserved when training on Amazon SageMaker.

Look at the documentation in general for Processing (you have to use FrameworkProcessor).

Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and the requirements.txt file.

The structure of the folder then will be:

BASE_DIR/
|- requirements.txt
|- preprocess.py

It is the common requirements file, nothing different. And it will be used automatically at the start of the instance, without any instruction needed.

So, your code becomes:

from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput


sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name=f"{base_job_prefix}/job-name",
    sagemaker_session=pipeline_session,
    role=role
)

step_args = sklearn_processor.run(
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="preprocess.py",
    source_dir=BASE_DIR,
    arguments=["--input-data", input_data],
)

step_process = ProcessingStep(
    name="PreprocessSidData",
    step_args=step_args
)

Note that I changed both the code parameter and the source_dir. It's a good practice to keep the folders for the various steps separate so you have a requirements.txt for each and don't create overlaps.

it is showing error: TypeError: run() got an unexpected keyword argument 'source_dir' — sid8491, Nov 23 '22 at 19:08
Absolutely true! I remembered that SKLearnProcessor does not accept this parameter, but the generic FrameworkProcessor (which can be configured equally as SKLearnProcessor) does. I edited the answer by modifying the tested and working code. — Giuseppe La Gualano, Nov 23 '22 at 19:29
i just checked this, it did not copy the requirements.txt with the source code. right now i am checking dependencies parameter of FrameworkProcessor. — sid8491, Nov 24 '22 at 05:42
it worked by adding `dependencies=[os.path.join(BASE_DIR, "requirements.txt")],` please add this in your answer. — sid8491, Nov 24 '22 at 07:03
It is very strange what you are saying, by specifying the source_dir, the latter will be copied entirely to the sourcedir.tar.gz (and thus including all the files that were inside including the requirements.txt). Look at this [official example](https://sagemaker.readthedocs.io/en/v2.18.0/frameworks/tensorflow/using_tf.html#use-third-party-libraries) that specifies exactly what I'm telling you. I can add dependencies param to the answer, but the way I wrote you must also work (I use it daily) — Giuseppe La Gualano, Nov 24 '22 at 09:53
that's what i wondering as well. it should have copied everything but it specifically omitted the requirements.txt and copied everything else. and yes have used it with training scripts, but it was notw orking same way in processing step. — sid8491, Nov 24 '22 at 11:07
The sagemaker version needs to be investigated at this point. I see from the original question that you do not have the latest (2.117.0) but you have 2.93.0. I wonder if it is something recently introduced. — Giuseppe La Gualano, Nov 24 '22 at 11:10

How to install additional packages in sagemaker pipeline

1 Answers1

Linked