0

I've written a webscraping script to extract a web table in a dataframe format which I have converted into a TabularDataset using Dataset.Tabular.register_pandas_dataframe() to store in the default datastore.

I want to pass this webscraped table as a side_input to ParallelRunStep() in a batch inferencing pipeline but in order to achieve that, the side_input should be in FileDataset type.

The established way that has worked so far to convert TabularDataset into FileDataset was using

side_input = Dataset.File.from_files(path="/path/to/file/on/datastore") 

The above method worked for uploaded csv files in the datastore. But thing with registering a pandas dataframe in the datastore is that with each run of the webscraping script occurs a registration of pandas dataframe in the datastore to convert it into a TabularDataset and the relative path in the datastore changes.

Hardcoding the relative path works but since the web table data changes periodically, I want the latest data from the web table to be used aas a side_input

My questions:

  • how to convert pandas dataframe into FileDataset eg: to_FileDataset if it even exists?
  • how to convert TabularDatset into FileDataset?
  • if there is any way to find the relative path of the registered tabular dataset in the datastore using azureml python sdk v2?

side note: the metadata of tabular dataset shows a json info something like this

>>> tabular_ds
{
  "source": [("default_datastore", "relative/path/to/the/tabular/dataset")],
   .
   .
   .
}

I was wondering if I can extract the source key from this like I can extract the ws.name after initializing ws = Workspace.from_config() . Just thinking out loud.

1 Answers1

0

One possible solution to convert a TabularDataset into a FileDataset, is by using to_csv_files method which returns a FileDataset object from TabularDataset object.

enter image description here

This will support all the relevant methods of FileDataset Object.

RishabhM
  • 525
  • 1
  • 5