3

I am using Azure ML Notebook with python kernel to run the following code:

%reload_ext rpy2.ipython

from azureml.core import Dataset, Datastore,Workspace

subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'

workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'mynewdatastore')

# create tabular dataset from all parquet files in the directory
tabular_dataset_1 = Dataset.Tabular.from_parquet_files(path=(datastore,'/RNM/CRUD_INDIFF/CrudeIndiffOutput_PRD/RW_Purchases/2022-09-05/RW_Purchases_2022-09-05T17:23:01.01.parquet'))
df=tabular_dataset_1.to_pandas_dataframe()
print(df)

After executing this code, I am getting the Cancelled message from the notebook cell and also getting the message on top of the cell as:

The code being run in the notebook may have caused a crash or the compute may have run out of memory.
Jupyter kernel is now idle.
Kernel restarted on the server. Your state is lost.

2 cores, 14 GB RAM and 28 GB Disk Space is allocated to the compute instance. The Parquet file which I am using in the code is of size 20.25 GiB and I think due to the large size of this file, this problem is being created. Can anyone please help me how to resolve this error without breaking the file into multiple files of small sizes. Any help would be appreciated.

ankit
  • 277
  • 1
  • 4
  • 25

2 Answers2

3

On reading the dataset using the Pandas read_ function, default data types are assigned to each feature column. By observing feature values Pandas decides data type and loads it in the RAM. A value with data type as int8 takes 8x times less memory compared to int64 data type so could change datatypes to use small int,floats etc. I suspect the error is caused because of 14gb RAM.

like @ndclt says you can load data in chunks. Try that first but
If that does not work, I would move away from using pandas entirely. Use an alternative such as pyspark,dask,polars instead.

The following libraries listed are much more ideal for your situation as they are a lot more efficient and a lot faster when dealing with larger amounts of data.

looks like there is a method to load data into spark data frame from azure Dataset Class so this is more ideal for what you are doing . First you need to make sure you have a spark cluster setup which you can do in azure synapse. Then then link it to azureml workspace

create spark cluster- https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal

link synapse to ml workspace- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-link-synapse-ml-workspaces.

dataset to spark df- https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#azureml-core-dataset-to-spark-dataframe

There is a lot more detail about this if you go onto notebook samples in azure ml. There should be a folder called azure-synapse which has good info and code samples.

once you setup spark cluster and link to azureml workspace should just be able to do the following

df=tabular_dataset_1.to_spark_dataframe()
Muhammad Pathan
  • 429
  • 3
  • 12
  • Thank you so much for the answer. Since, I don't have right to create spark cluster, I would inform the team and will try it. Thank you for the help :) – ankit Oct 02 '22 at 14:29
2

The Parquet file which I am using in the code is of size 20.25 GiB and I think due to the large size of this file, this problem is being created

Yes surely. And as parquet can be compressed, the size of the file uncompressed could be bigger and the library (from azure or pandas) will add some overhead.

For not loading the whole file, there is two ideas:

  • load few rows,
  • load less columns (not all of them).

From what I read in the documentation of Dataset.Tabular.from_parquet_files, I cannot find any way to apply one of two methods above. :/

But, you can maybe trick by downloading the file on the server (find in this answer) and after read by chunk (find there) or partially load the columns.

from azureml.core import Dataset, Datastore,Workspace
import pyarrow.parquet as pq


subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'
dstore_path = '/RNM/CRUD_INDIFF/CrudeIndiffOutput_PRD/RW_Purchases/2022-09-05'
parquet_file_name = 'RW_Purchases_2022-09-05T17:23:01.01.parquet'

workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'mynewdatastore')

target = (datastore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
    ds = Dataset.File.from_files(target)
    ds.download(tmpdir)
    # you have the parquet file in tmpdir. You can read it by chunk or select
    # the column you need (if you can)
    pq_file = pq.ParquetFile(f'tmpdir/{parquet_file_name}')
    for batch in pq_file.iter_batches():
        print("RecordBatch")
        batch_df = batch.to_pandas()
        # do thing with the batch
    

Iter_batches documentation with the columns argument allowing you to load only some columns.

Working by batch implies that you don't need the whole file to be loaded. If it's the case, you will have to change the machine used for your Jupyter notebook.

ndclt
  • 2,590
  • 2
  • 12
  • 26
  • Actually, I have to use the whole data of 20 GiB for processing, so if we break the file or take some columns of the table for the purpose of reading data then also at some point of time, I have to combine data and then notebook cell might crash again. Is there any way to load/read whole data of 20 GiB in notebook cell ? I was using RStudio where it was working fine but in notebook, it is creating problem. – ankit Sep 29 '22 at 06:33
  • 1
    What operation do you need to do on the dataframe? Was Rstudio running on the same machine that the one you describe. If yes, I guess, that Rstudio is working by chunk on your dataframe without telling you. – ndclt Oct 01 '22 at 17:40
  • RStudio is running on my local system (8 GB RAM) and it takes around 2 days to process millions of rows of this ~20 GiB data and Azure ML notebook is hosted on the server and here I am facing this issue. Operations are simple on the dataframe like filtering data etc but since there is a loop included in the code which is taking a lot of time to process millions of rows when I run it through local system and on server, notebook cell is crashing while loading this ~20 GiB file. Anyway, Thank you so much for giving time to write this answer :) – ankit Oct 02 '22 at 14:24