I need to import function from different python scripts, which will used inside preprocessing.py
file. I was not able to find a way to pass the dependent files to SKLearnProcessor
Object, due to which I am getting ModuleNotFoundError
.
Code:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
instance_type='ml.m5.xlarge',
instance_count=1)
sklearn_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source=input_data,
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(output_name='train_data',
source='/opt/ml/processing/train'),
ProcessingOutput(output_name='test_data',
source='/opt/ml/processing/test')],
arguments=['--train-test-split-ratio', '0.2']
)
I would like to pass,
dependent_files = ['file1.py', 'file2.py', 'requirements.txt']
. So, that preprocessing.py
have access to all the dependent modules.
And also need to install libraries from requirements.txt
file.
Can you share any work around or a right way to do this?
Update-25-11-2021:
Q1.(Answered but looking to solve using FrameworkProcessor
)
Here, the get_run_args
function, is handling dependencies
, source_dir
and code
parameters by using FrameworkProcessor. Is there any way that we can set this parameters from ScriptProcessor
or SKLearnProcessor
or any other Processor
to set them?
Q2.
Can you also please show some reference to use our Processor
as sagemaker.workflow.steps.ProcessingStep
and then use in sagemaker.workflow.pipeline.Pipeline
?
For having Pipeline
, do we need sagemaker-project
as mandatory or can we create Pipeline
directly without any Sagemaker-Project
?