4

I would like to setup Mlflow to have the following components :

  • Backend store (local) : using a SQLite database locally to store Mlflow entities (run_id, params, metrics...)
  • Artifact store (remote) : using a blob storage on my Azure Data Lake Storage Gen2 to store the output files (versioned datasets, serialized models, images, ...) related to my model
  • Tracking server : by using something that looks like this command

z

mlflow server --backend-store-uri sqlite:///C:\sqlite\db\mlruns.db --default-artifact-root wasbs://container-name@storage_account_name.blob.core.windows.net/mlartifacts -h 0.0.0.0 -p 8000

Where mlruns.db is a database that I created in SQLite (inside a db folder) and mlartifacts is the folder I created inside the blob container to receive all the output files.

I run this command and then I do and mlflow run (or a kedro run as I'm using Kedro) but almost nothing happens. The database is populated with 12 tables but all empty while nothing happens inside the Data lake.

What I want should look like Scenario 4 in the documentation.

For the artifact store, I couldn't find detailed instructions. I tried to look at Mlflow's documentation here but this is not very helpful (i'm still a beginner). They say that:

MLflow expects Azure Storage access credentials in the AZURE_STORAGE_CONNECTION_STRING, AZURE_STORAGE_ACCESS_KEY environment variables or having your credentials configured such that the DefaultAzureCredential(). class can pick them up.

However, even when adding the env variables, nothing seems to be stored in the data lake. I created the two env variables (on Windows 10):

  • AZURE_STORAGE_ACCESS_KEY = wasbs://container-name@storage_account_name.blob.core.windows.net/mlartifacts

  • AZURE_STORAGE_CONNECTION_STRING = DefaultEndpointsProtocol=https;AccountName=storagesample;AccountKey=. I got it by following this path on Azure Portal : Storage account/Access keys/Connection string (took the one of key 2).

They also state that :

Also, you must run pip install azure-storage-blob separately (on both your client and the server) to access Azure Blob Storage. Finally, if you want to use DefaultAzureCredential, you must pip install azure-identity; MLflow does not declare a dependency on these packages by default.

I added them in my project requirements, but what do they mean exactly by installing on both the client and the server ? How azure-identity helps in the setup ?

Could you please help me with a step by step instructions on how to make the complete setup ?

Thank you in advance !

Downforu
  • 317
  • 5
  • 13
  • 1
    Not exactly what you are asking for, but would using Azure ML as MLflow server be an alternative solution? That also uses Storage Blob for artifacts and removes the need to setup a SQL server. The MLflow client is also simpler to setup. – Matthieu Maitre Nov 26 '21 at 16:40
  • Thank you for your reply. Actually, my first intention was to do exactly what you propose. I even posted on that matter before I posted this one. See https://stackoverflow.com/questions/70010405/run-experiments-on-azure-ml-with-kedro-and-mlflow. I haven't tried to make it work since then, but I'm open to any suggestion if you have any tips on how to use Kedro together with Mlflow and Azure ML as a tracking server. – Downforu Nov 29 '21 at 11:53

2 Answers2

3

You need just to set AZURE_STORAGE_CONNECTION_STRING, AZURE_STORAGE_ACCESS_KEY is optional if first environment variable is used (anyway, AZURE_STORAGE_ACCESS_KEY shouldn't be the URL, but actual access key).

Regarding azure-storage-blob package - it should be installed on both server where you run mlflow server, and on the same machine where you run your training (client).

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thank you for your answer ! You're right about the AZURE_STORAGE_ACCESS_KEY, I don't know why I made this mistake. Ultimately, it was not taken into account in my case as I used AZURE_STORAGE_CONNECTION_STRING. I finally managed to make it work by using double backslash and using port 5000 ==> ```mlflow server --backend-store-uri sqlite:///C:\\sqlite\\db\\mlruns.db --default-artifact-root wasbs://container-name@storage_account_name.blob.core.windows.net/mlartifacts -h 0.0.0.0 -p 5000``` – Downforu Nov 29 '21 at 11:50
  • Is it possible to set the path of sqlite db to an s3 location instead of local C: location? – asanoop24 Mar 15 '22 at 19:55
  • No by default, only if you have some fuse mount or something like – Alex Ott Mar 15 '22 at 20:03
0

I added this answer as it could be useful for others (I had to do a lot of searching for a method to save mlflow artifacts in blob storage using a Service Principal): In many cases it is required (and recommended) to use a Service Principal instead of using AZURE_STORAGE_ACCESS_KEY. For this, the ClientSecretCredential can be used. In this case, it would require the environment variables AZURE_TENANT_ID, AZURE_CLIENT_ID and AZURE_CLIENT_SECRET to be provided. For details refer to azure ClientSecretCredential. Mlflow implementation uses the DefaultAzureCredential that will automatically use the environment variables and set this up. For details refer this pull request (Enable AzureDefaultCredential for authenticating against Azure Storage backends). Remember to install azure-identity and azure-storage-blob dependencies if not already installed.

If you get an error like

DefaultAzureCredential failed to retrieve a token from the included credentials.

check that the environment variables are correctly set.

Kumar Saurabh
  • 711
  • 7
  • 7