How to write Azure machine learning batch scoring results to data lake?

Question

I'm trying to write the output of batch scoring into datalake:

    parallel_step_name = "batchscoring-" + datetime.now().strftime("%Y%m%d%H%M")
    
    output_dir = PipelineData(name="scores", 
                              datastore=def_ADL_store,
                              output_mode="upload",
                              output_path_on_compute="path in data lake")

parallel_run_config = ParallelRunConfig(
    environment=curated_environment,
    entry_script="use_model.py",
    source_directory="./",
    output_action="append_row",
    mini_batch_size="20",
    error_threshold=1,
    compute_target=compute_target,
    process_count_per_node=2,
    node_count=2
)
    
    batch_score_step = ParallelRunStep(
        name=parallel_step_name,
        inputs=[test_data.as_named_input("test_data")],
        output=output_dir,
        parallel_run_config=parallel_run_config,
        allow_reuse=False
    )

However I meet the error: "code": "UserError", "message": "User program failed with Exception: Missing argument --output or its value is empty."

How can I write results of batch score to data lake?

score 1 · Accepted Answer · answered Aug 07 '20 at 06:24

1

I don’t think ADLS is supported for PipelineData. My suggestion is to use the workspace’s default blob store for the PipelineData, then use a DataTransferStep for after the ParallelRunStep is completed.

answered Aug 07 '20 at 06:24

Anders Swanson

3,637
1
18
43

output_dir = PipelineData(name="scores", datastore=datastore, ) adls_data_ref = DataReference( datastore=def_blob_store, data_reference_name="adls_test_data", path_on_datastore="Private/Opportunities/ModelData/OneModel") transfer_blob_to_adls = DataTransferStep( name="transfer_blob_to_adls", source_data_reference=output_dir, destination_data_reference=adls_data_ref, compute_target=data_factory_compute) But I met the error: unexpected error: Blob contains both folder and file with same name – chenxu Aug 07 '20 at 07:49
can you share your new `PipelineData` definition for `output_dir`? – Anders Swanson Aug 07 '20 at 16:40
1

you're very close now. here's [a thread](https://social.microsoft.com/Forums/mvpforum/fr-FR/026b9b1d-6961-4217-b179-0c1973ac1fa2/data-transfer-job-failed-with-unexpected-error-systeminvalidoperationexception-blob-contains-both?forum=AzureMachineLearningService) I found with the error message you have – Anders Swanson Aug 07 '20 at 16:41
1

Thank you. I add source_reference_type: directory and succeed. – chenxu Aug 10 '20 at 02:38
Awesome!!! Our team really loves Azure ML Pipelines, I’d love to see what you develop — especially w `ParallelRunStep` – Anders Swanson Aug 10 '20 at 02:40
1

Thanks @AndersSwanson, this answer was a life saver! And thanks to chenxu as well, your "add source_reference_type: directory" solution really brought it all home! – yeamusic21 Dec 21 '20 at 21:02

How to write Azure machine learning batch scoring results to data lake?

1 Answers1

Linked