My goal is to make a pipeline in Azure ML Studio that handles:
- an input_folder with audio files
- an output_folder with the desired output
- some parameters. The input/output folders are Azureml connections to existing Azure storage containers.
I have successfully created and tested a component that takes the same input/output as well as a model input. When running a command job, it successfully accepts a dynamic input/output. The model, environment, and component are all successfully registered with Azure ml.
But now I want to create a batch endpoint where I can submit different inputs/outputs. My approach is to define a pipeline using the existing component, but I run into trouble when specifying my model input as a default.
One challenge is that a Batch endpoint apparently only allows uri_folder and literals, but not custom_model when invoking. (See https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-batch-endpoints-jobs?view=azureml-api-2&tabs=sdk#data-inputs)
I have tried various options, and I will elaborate on two different approaches below, each with its shortcomings, that I see as potential solutions. I just need that last part to make it work.
Using Pipeline.yaml Using an example yaml file, I was able to write out the pipeline. This works when registering as a pipeline, but when invoking the batch endpoint, it won't accept an azureml:mymodel:1 address. (I used https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-batch-pipeline-deployments?view=azureml-api-2&tabs=python as base)
$schema: https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline
name: pipebatch
display_name: batchpipe
description: **
inputs:
input_folder:
type: uri_folder
optional: false
model_folder:
type: custom_model
optional: false
language:
type: string
default: nl
optional: true
compute_type:
type: string
default: int8
optional: true
vad_filter:
type: boolean
default: false
optional: true
beam_size:
type: integer
min: 1
max: 20
default: 5
optional: true
logging_level:
type: string
default: info
optional: true
outputs:
output_folder:
type: uri_folder
error_folder:
type: uri_folder
jobs:
fasterwhisper_job:
type: command
environment: azureml:environment@latest
component: azureml:component@latest
compute: azureml:mycluster
inputs:
input_folder: ${{parent.inputs.input_folder}}
model_folder: ${{parent.inputs.model_folder}}
language: ${{parent.inputs.language}}
compute_type: ${{parent.inputs.compute_type}}
vad_filter: ${{parent.inputs.vad_filter}}
beam_size: ${{parent.inputs.beam_size}}
logging_level: ${{parent.inputs.logging_level}}
outputs:
output_folder: ${{parent.outputs.output_folder}}
error_folder: ${{parent.outputs.error_folder}}
But if I want to replace the line model_folder: ${{parent.inputs.model_folder}}
with a default input, it cannot accept something like azureml:mymodel:1
. How can I define the pipeline with a given model in this case?
Define pipeline with python The second approach was to use the pipeline decorator and build functionality:
mycomponent = ml_client.components.get("registeredcomponent", label="latest")
mymodel = ml_client.models.get('registeredmodel', label='latest')
@pipeline()
def mypipeline(
input_folder: Input(type=AssetTypes.URI_FOLDER),
language: str,
compute_type: str,
vad_filter: bool,
beam_size: int,
logging_level: str,
# output_folder: Output(type=AssetTypes.URI_FOLDER),
# error_folder: Output(type=AssetTypes.URI_FOLDER),
):
myoutput = mycomponent(
input_folder=input_folder,
model_folder=Input(type=AssetTypes.CUSTOM_MODEL, path=mymodel.id),
language=language,
compute_type=compute_type,
vad_filter=vad_filter,
beam_size=beam_size,
logging_level=logging_level,
# output_folder=output_folder,
# error_folder=error_folder
)
return {"output_folder": myoutput.outputs.output_folder, "error_folder": myoutput.outputs.error_folder}
my_pipeline_built = mypipeline._pipeline_builder.build()
But in this case I cannot define my outputs as part of the pipeline input. If you uncomment the output/error folders it will throw an error when registering that there are duplicates. But if you leave them commented, then when you deploy the pipeline, you get an error:
Invalid Pipeline Component mypipeline missing jobs or source job name to define pipeline.
Unfortunately I cannot find an example that has both input and output using python SDK v2 pipeline builder.
Using model deployment unfortunately is not suitable to my scenario as I need multiple input and outputs as well as literal inputs, whereas model deployment only allows for 1 input and 1 output.
I simply want to have a batch endpoint that will accept inputs and outputs dynamically, as I want to use the same pipeline for multiple different azure storage containers.