2

I have a machine learning model deployed in azure designer studio. I need to retrain it everyday with new data through python code. I need to keep the existing csv data in the blob storage and also add some more data to the existing csv and retrain it. If I retrain the model with only the new data, the old data is lost so I need to retrain the model by appending new data to existing data. Is there any way to do it through python coding?

I have also researched about append blob but they add only in the end of the blob. In the documentation, they have mentioned we cannot update or add to an existing blob.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Rakesh RG
  • 60
  • 8
  • Are you using block blobs or append blobs? – Gaurav Mantri Mar 11 '21 at 04:52
  • I created a resource group and a workspace in the azure portal. It automatically created a storage account and I add csv file to them. Its default block blob storage. We can also use append blob but I need to add new data to the same CSV file. Is there any way to do that? – Rakesh RG Mar 11 '21 at 04:58
  • Why does it have to be one CSV? why not one CSV for every day? – Anders Swanson Mar 11 '21 at 05:21

1 Answers1

2

I'm not sure why it has to be one csv file. There are many Python-based libraries for working with a dataset spread across multiple csvs.

In all of the examples, you pass a glob pattern, that will match multiple files. This pattern works very naturally with Azure ML Dataset which you can use as your input. See this excerpt from the docs link above.

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')] # here's the glob pattern

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

Assuming that all the csvs can fit into memory, you can turn these datasets easily into pandas dataframes. with Azure ML Datasets, you call

# get the input dataset by name
dataset = Dataset.get_by_name(ws, name=dataset_name)
# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()

With Dask Dataframe, this GitHub issue says you can call

df = my_dask_df.compute()

As far as output datasets, you can control this by reading in the output CSV as a dataframe, appending data to it then overwriting it to the same location.

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
  • 1
    I'm doing some operations after converting CSV to pandas dataframe during training the model. Will be still be able to do that in a dataset with pandas or anything ? Code for your reference below: `df = pd.read_csv('prediction_data01.csv')` `df = df[pd.notnull(df['DESCRIPTION'])]` `df = df[pd.notnull(df['CUSTOMERCODE'])]` `col = ['CUSTOMERCODE', 'DESCRIPTION']` `df = df[col]` `df.columns = ['CUSTOMERCODE', 'DESCRIPTION']` `df['category_id'] = df['DESCRIPTION'].factorize()[0]` – Rakesh RG Mar 11 '21 at 07:05
  • 1
    I'm also using TfidfVectorizer, CountVectorizer and test_train_split, TfidfTransformer and fit methods – Rakesh RG Mar 11 '21 at 07:19
  • Assuming that all the csvs can fit into memory, this is rather simple. with Azure ML Datasets, you call `df = dataset.to_pandas_dataframe()`. With Dask Dataframe, you can call ` – Anders Swanson Mar 11 '21 at 07:28
  • 1
    Thank you so much Anders. I'll try it out and let you know if I need any help – Rakesh RG Mar 11 '21 at 07:32
  • cool just editing my original answer w/ more info as well – Anders Swanson Mar 11 '21 at 07:33