How to add additional files in Sagemaker Pipeline Processing Step

Question

I want to have additional files which can be imported in preprocess.py file
but i am not able to import these directly.

My directory looks like this:

Want to import from helper_functions directory into preprocess.

I tried to add this in setup.py file but it did not work.

package_data={"pipelines.ha_forecast.helper_functions": ["*.py"]},

One thing which kind of worked was to add this folder in input like this:

inputs = [
ProcessingInput(source=f'{project_name}/{module_name}/helper_functions',
destination="/opt/ml/processing/input/code/helper_functions"),
]

But this was putting the required files in some other directory which I was not able to import anymore.

What is standard way of doing this?

score 1 · Accepted Answer · answered Nov 23 '22 at 19:48

You have to specify the source_dir. Within your script then you can import the modules as you normally do.

source_dir (str or PipelineVariable) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory are preserved when training on Amazon SageMaker.

Look at the documentation in general for Processing (you have to use FrameworkProcessor and not the specific ones like SKLearnProcessor).

P.S.: The answer is similar to that of the question "How to install additional packages in sagemaker pipeline".

Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and also eventually the requirements.txt file.

The structure of the folder then will be:

BASE_DIR/
|─ helper_functions/
|  |─ your_utils.py
|─ requirements.txt
|─ preprocess.py

Within your preprocess.py, you will call the scripts in a simple way with:

from helper_functions.your_utils import your_class, your_func

So, your code becomes:

from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput

BASE_DIR = your_script_dir_path

sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name=base_job_name,
    sagemaker_session=pipeline_session,
    role=role
)

step_args = sklearn_processor.run(
    inputs=[your_inputs],
    outputs=[your_outputs],
    code="preprocess.py",
    source_dir=BASE_DIR,
    arguments=[your_arguments],
)

step_process = ProcessingStep(
    name="ProcessingName",
    step_args=step_args
)

It's a good practice to keep the folders for the various steps separate for each and don't create overlaps.

It is not the same answer. It is about two different questions that are solved in a similar way. You can see that the answers are in one general case and in the other specific case. — Giuseppe La Gualano, Nov 23 '22 at 19:53
it worked, however it changed the input directories of s3, will get details of FrameworkProcessor now. — sid8491, Nov 24 '22 at 05:40
man this automatic triggering of pipeline on code comming is so random, sometimes it happens and sometimes it doesn't. and i cannot see where it is defined the rule for this CICD pipeline trigger. so frustrating. — sid8491, Nov 24 '22 at 07:26
Can you specify the problem better? If it would help to avoid causality, you can try putting the body of step_args directly inside the run. — Giuseppe La Gualano, Nov 24 '22 at 09:55
i have created a project (pipeline process, train, deploy template) in sagemaker studio, which created a pipeline. it also configured the pipeline in such way that whenever we commit the changes it triggers the pipeline to run (i dont know where is this configuration). sometimes it works i.e it triggers the pipeline, sometimes when i commit the changes and push it does not trigger the pipeline. — sid8491, Nov 24 '22 at 11:13
Look at this question, voted enough, with no answer: ["SageMaker pipeline execution after commit is not triggered"](https://stackoverflow.com/questions/68738337/sagemaker-pipeline-execution-after-commit-is-not-triggered). I don't think your problem is run dependent. The person in the question followed a specific tutorial and had the same problem as you. I hope this will be helpful to you! — Giuseppe La Gualano, Nov 24 '22 at 11:17
yeah I have seen this question, and got sad when i saw no answers. :( — sid8491, Nov 24 '22 at 11:25

How to add additional files in Sagemaker Pipeline Processing Step

1 Answers1

Linked