How to store the output dataframe/data of an azure ml job as a data asset

Question

I've been following this "tutorial" by azure on how to create an end to end pipeline in azure: https://github.com/Azure/azureml-examples/blob/main/tutorials/e2e-ds-experience/e2e-ml-workflow.ipynb

I am doing something similar, whereas instead of doing credit predictions, I create word embeddings. The problem is that while I am able to get the output, and create a data asset from the csv manually through the UI, I would like to be able to make this a step of the pipeline. I've searched a lot, but have not found a surefire way to make this work automatically.

My code is mostly similar to the one in the tutorial, with the big exception being the train-job:

fetch_model_component = command(
name="fetch_pre_trained_model_and_create_embeddings",
display_name="Fetch Pre-Trained Model and create embeddings",
description="fetches a pre-trained sbert model, and uses text to create document embeddings",

inputs={
    "data": Input(type="uri_folder"),
    "registered_model_name": Input(type="string")
},
outputs={
    "model": Output(type="uri_folder", mode="rw_mount"),
    "embeddings": Output(type="uri_folder", mode="rw_mount", path="azureml:embeddings:1")
},
# The source folder of the component
code=train_src_dir,
command="""python pre_trained.py \
        --data ${{inputs.data}} --registered_model_name ${{inputs.registered_model_name}}
        --model ${{outputs.model}} --embeddings ${{outputs.embeddings}} \
        """,
environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",

)

The path for the embeddings is just something i tried that was mentioned in this page: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?tabs=python

Here is how the csv is stored: df.to_csv(os.path.join(args.embeddings, "embeddings.csv"), index=False)

Any pointers would be greatly appreciated, as this is somewhat of a last resort before scrapping the idea.

How to store the output dataframe/data of an azure ml job as a data asset

0 Answers0