-1

I work for a retail organization. We are developing a control tool which will retrieve transactional data and displays 3 sites which are performing closely similar to a test site which we have chosen based on a chosen parameter(example :sales). This is currently working fine when the code is developed in R Shiny, connected to the On Prem database and executed from personal laptops. The requirement is to host this tool on Azure so others can access it easily.

We will be rewriting this tool using python and are planning to retrieve the csv data (which we have already ingested) from a datalake gen1.

Panda dataframe can only be created from local files and not directly from storage. This means we have to download the csv from the datalake, convert it to a pandas dataframe and then execute our python algorithms. When the files are downloaded to local storage, this would occupy disk storage on the app service which is limited to 250GB. The files can be are fairly large (>5GB) and there can be multiple of them. We will also have multiple users accessing this tool. I would think that the disk storage will get filled up fairly quickly.

Is there any way to automatically clear of the temp storage at regular intervals automatically? Should this be managed in the code itself after each execution.

Anupam Chand
  • 2,209
  • 1
  • 5
  • 14
  • .Azure is not short of resources to do what you are describing here. Broadly the following options are available 1. You do not have to download the files to a local disk. Using serverless Azure functions, read, transform and prune the files and write them back to any Azure storage you need to. 2. Azure databricks can also do what you are just describing. 3.You can also create external tables in synapse, spin resources and do anything you want to do with Pandas – wwnde Feb 10 '22 at 06:28

1 Answers1

0

First of all there is no such thing "Convert large csv Blob to Pandas dataframe".

Python Pandas module allows us to read any CSV file, transform the data and save it at any desired location.

What you are asking can be achieved using Azure Function.

Refer this SO thread to create a Azure function and read a blob file using it.

Use the below code to read that file using Pandas:

# Import pandas
import pandas as pd
 
# reading csv file
dataframe = pd.read_csv(<variable_in_which_you_read_the_CSV_file>)

Do the required transformation on this dataframe and upload the file to Azure Storage.

Refer Upload blobs to a container given code sample for uploading part.

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14