I have a WebScrapper application that is scheduled to run 5 times a day. This process is running in python, in my personal notebook
The output is a dataframe with no more than 20k rows. This dataframe is appended to a compressed file with all historical data. It’s a .csv.gzip ~110MB growing 0.5MB each day and its an input to a Dashboard in Power BI.
The problem is everytime the script runs, it has to
unzip → load the whole file in memory → append newly rows → save (overwrite)
and it’s not very efficient.
It seems the best way would be a format that allows to append latest data without reading all the file
Now we are migrating the application to Azure and we have to adapt our architecture. We are using Azure Functions to run the WebScrapper, and as storage, we are using Azure Blob.
Is there a more viable architecture to do this job (append new extraction to a historical file) rather than using gzip?
I am assuming that SQL Database would be more expensive, so i am giving the last chance to work out this on Blob, at low cost mode.
Update
Code below works locally. It appends newly data to historical gzip without loading it.
df.to_csv(gzip_filename, encoding='utf-8', compression='gzip', mode='a')
Code below not working on Azure. It overwrites historical data by newly data.
container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
output = df.to_csv(index=False, encoding='utf-8', compression='gzip', mode='a')
container_client.upload_blob(gzip_filename, output, overwrite=True, encoding='utf-8')