0

My goal is to make a pipeline in Azure ML Studio that handles:

  • an input_folder with audio files
  • an output_folder with the desired output
  • some parameters. The input/output folders are Azureml connections to existing Azure storage containers.

I have successfully created and tested a component that takes the same input/output as well as a model input. When running a command job, it successfully accepts a dynamic input/output. The model, environment, and component are all successfully registered with Azure ml.

But now I want to create a batch endpoint where I can submit different inputs/outputs. My approach is to define a pipeline using the existing component, but I run into trouble when specifying my model input as a default.

One challenge is that a Batch endpoint apparently only allows uri_folder and literals, but not custom_model when invoking. (See https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-batch-endpoints-jobs?view=azureml-api-2&tabs=sdk#data-inputs)

I have tried various options, and I will elaborate on two different approaches below, each with its shortcomings, that I see as potential solutions. I just need that last part to make it work.

Using Pipeline.yaml Using an example yaml file, I was able to write out the pipeline. This works when registering as a pipeline, but when invoking the batch endpoint, it won't accept an azureml:mymodel:1 address. (I used https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-batch-pipeline-deployments?view=azureml-api-2&tabs=python as base)

$schema: https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline
name: pipebatch
display_name: batchpipe
description: **

inputs:
  input_folder:
    type: uri_folder
    optional: false
  model_folder:
    type: custom_model
    optional: false
  language:
    type: string
    default: nl
    optional: true
  compute_type:
    type: string
    default: int8
    optional: true
  vad_filter:
    type: boolean
    default: false
    optional: true
  beam_size:
    type: integer
    min: 1
    max: 20
    default: 5
    optional: true
  logging_level:
    type: string
    default: info
    optional: true
    
outputs:
  output_folder:
    type: uri_folder
  error_folder:
    type: uri_folder

jobs:
  fasterwhisper_job:
    type: command
    environment: azureml:environment@latest
    component: azureml:component@latest
    compute: azureml:mycluster
    inputs:
      input_folder: ${{parent.inputs.input_folder}}
      model_folder: ${{parent.inputs.model_folder}}
      language: ${{parent.inputs.language}}
      compute_type: ${{parent.inputs.compute_type}}
      vad_filter: ${{parent.inputs.vad_filter}}
      beam_size: ${{parent.inputs.beam_size}}
      logging_level: ${{parent.inputs.logging_level}}
      
    outputs:
      output_folder: ${{parent.outputs.output_folder}}
      error_folder: ${{parent.outputs.error_folder}}

But if I want to replace the line model_folder: ${{parent.inputs.model_folder}} with a default input, it cannot accept something like azureml:mymodel:1. How can I define the pipeline with a given model in this case?

Define pipeline with python The second approach was to use the pipeline decorator and build functionality:

mycomponent = ml_client.components.get("registeredcomponent", label="latest")
mymodel = ml_client.models.get('registeredmodel', label='latest')

@pipeline()
def mypipeline(
    input_folder: Input(type=AssetTypes.URI_FOLDER), 
    language: str,
    compute_type: str,
    vad_filter: bool,
    beam_size: int,
    logging_level: str,
    # output_folder: Output(type=AssetTypes.URI_FOLDER),
    # error_folder: Output(type=AssetTypes.URI_FOLDER),
):

    myoutput = mycomponent(
        input_folder=input_folder,
        model_folder=Input(type=AssetTypes.CUSTOM_MODEL, path=mymodel.id),
        language=language,
        compute_type=compute_type,
        vad_filter=vad_filter,
        beam_size=beam_size,
        logging_level=logging_level,
        # output_folder=output_folder,
        # error_folder=error_folder
    )

    return {"output_folder": myoutput.outputs.output_folder, "error_folder": myoutput.outputs.error_folder}

my_pipeline_built = mypipeline._pipeline_builder.build()

But in this case I cannot define my outputs as part of the pipeline input. If you uncomment the output/error folders it will throw an error when registering that there are duplicates. But if you leave them commented, then when you deploy the pipeline, you get an error: Invalid Pipeline Component mypipeline missing jobs or source job name to define pipeline. Unfortunately I cannot find an example that has both input and output using python SDK v2 pipeline builder.

Using model deployment unfortunately is not suitable to my scenario as I need multiple input and outputs as well as literal inputs, whereas model deployment only allows for 1 input and 1 output.

I simply want to have a batch endpoint that will accept inputs and outputs dynamically, as I want to use the same pipeline for multiple different azure storage containers.

0 Answers0