0

I have a WebScrapper application that is scheduled to run 5 times a day. This process is running in python, in my personal notebook

The output is a dataframe with no more than 20k rows. This dataframe is appended to a compressed file with all historical data. It’s a .csv.gzip ~110MB growing 0.5MB each day and its an input to a Dashboard in Power BI.

The problem is everytime the script runs, it has to

unzip → load the whole file in memory → append newly rows → save (overwrite)

and it’s not very efficient.

It seems the best way would be a format that allows to append latest data without reading all the file

Now we are migrating the application to Azure and we have to adapt our architecture. We are using Azure Functions to run the WebScrapper, and as storage, we are using Azure Blob.

Is there a more viable architecture to do this job (append new extraction to a historical file) rather than using gzip?

I am assuming that SQL Database would be more expensive, so i am giving the last chance to work out this on Blob, at low cost mode.

Update

Code below works locally. It appends newly data to historical gzip without loading it.

df.to_csv(gzip_filename, encoding='utf-8', compression='gzip', mode='a')

Code below not working on Azure. It overwrites historical data by newly data.

container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)

output = df.to_csv(index=False, encoding='utf-8', compression='gzip', mode='a')

container_client.upload_blob(gzip_filename, output, overwrite=True, encoding='utf-8')
  • Please provide the code written in Azure Function! –  Mar 23 '22 at 13:16
  • In Azure Blob Storage also, you can overwrite the files to update/append the latest content to the existing files! –  Mar 23 '22 at 13:18
  • I believe you can use Azure Append Blobs in Blob Storage which is optimized for Append Operations. https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs – Dhanuka Jayasinghe Mar 23 '22 at 17:15
  • Also, you can refer to this [workaround](https://stackoverflow.com/a/71584080/16630138) to know how the blob storage trigger in Azure Function will update the existing files and the references show the code related to appending the new content to existing content if those files exist! –  Mar 23 '22 at 17:25
  • Hey, everyone.. i've just discovered that .gzip has mode='a'. it works local, but not working in Azure Blob. I think that link @DhanukaJayasinghe is helpfull to understand append_blobs. I'll provide some code in the question – Júlio Manfio Mar 23 '22 at 18:21

2 Answers2

0

I have noticed you have used overwrite=True. Which means
If overwrite=True is used, the old append blob is removed and a new one is generated. False is the default value.

If overwrite=False is specified while the data already exists, no error is raised and the data is appended to the existing blob.

So, Try to change the overwrite method into False. Which follows,

container_client.upload_blob(gzip_filename, output, overwrite=False, encoding='utf-8')  

Refer here for more information

0

I have found a way following @Dhanuka Jayasinghe's tip. Using Append Blobs.

The code below is working for me. It appends last lines without having to read the whole file.

#establish connection to Container in Blob Storage
container_client = ContainerClient.from_connection_string(conn_str='your_conn_str', container_name='your_container_name')

output = df.to_csv(index=False, encoding='utf-8', header=None) #header=None if your file already have headers


#saving to blob specifying **blob_type**
container_client.upload_blob("output_filename.csv", output, encoding='utf-8', blob_type=BlobType.AppendBlob)

Reference:

Microsoft Documentation about 3 blob types (block, page and append)

Stackoverflow question with an explanation for appendblobs