0

I have several files (11) already as datasets (mltable) in Azure ML Studio. Loading to df's works to all the cases except one. I believe the reason for that is the size - 1.95 GB. I wonder how can I load this dataset to dataframe? So far I did not manage to load it at all.

Any tips how to do it effectively? I tried to figure out a way to do it in parallel with the modin but failed. Below you will find the load script.


subscription_id = 'xyz'
resource_group = 'rg-personal'
workspace_name = 'test'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='buses')
dataset.to_pandas_dataframe()
Przemek
  • 41
  • 8
  • 1
    Does this answer your question? [How to load large data into pandas efficiently?](https://stackoverflow.com/questions/48989597/how-to-load-large-data-into-pandas-efficiently) – Ecstasy Jun 10 '22 at 04:36
  • 1
    Hi! It did not help but inspired me to load it differently. I already posted the solution. Anyways, thanks! – Przemek Jun 12 '22 at 18:56

2 Answers2

1

You can load the data using an AzureML long-form datastore URI directly into Pandas.

Ensure you have the azureml-fsspec Python library installed:

pip install azureml-fsspec

Next, just load the data:

import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

As this uses the AzureML datastore, it will automatically handle authentication for you without exposing SAS keys in the URI. Authentication can be either identity-based (i.e. passthrough your AAD to storage) or credential-based.

AzureML Datastore URIs are a known implementation of Filesystem spec (fsspec): A unified pythonic interface to local, remote and embedded file systems and bytes storage.

This implementation leverages the AzureML data runtime: a fast and efficient engine to materialize the data into a Pandas or Spark dataframe. The engine is written in Rust, which is known for high speed and high memory efficiency for data processing tasks.

Sam Kemp
  • 21
  • 2
  • Sounds like a great feature! Do you know if it works if the workspace is on a private endpoint / VNet? I tried this and got the error below (timeout). Thanks! `Unexpected failure while resolving environment for Datastore 'xx' in subscription: 'yy', resource group: 'aa', workspace: 'bb'.` `HTTPSConnectionPool(host='westeurope.api.azureml.ms', port=443): Max retries exceeded with url: /discovery (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 110] Connection timed out'))` – tomconte Jan 06 '23 at 10:36
  • Yes -- it will work behind private endpoint. Is where you are accessing the data also in the same VNET - for example, are you using a compute instance – Sam Kemp May 03 '23 at 12:27
  • @SamKemp I get: Exception: [PyDatastoreSource::list_directory] fails with error: PermissionDenied(Some(AuthenticationError("UnboundLocalError: local variable 'authority' referenced before assignment"))). Do you know how I can solve this? Thank you. (my datastore URI is good but it seems I have some permissions issues) – Timbus Calin Aug 18 '23 at 18:41
0

I found another solution, easier than what was posted by the @DeepDave

Instead of loading data from assets, I loaded them directly from blob with the URL, using the modin library instead of Pandas. Worked like a charm

Code below:

import modin.pandas as pd

url ='URLLINKHERE'
df_bus = pd.read_csv(url, encoding='utf16')
df_bus.head()

To supplement where to find URL.

  1. Go go storage and find the file.
  2. Right click on the file.
  3. Generate SAS.
  4. BLOB SAS URL -> that was the link I used.

Hope this help others.

Przemek
  • 41
  • 8