I have a python script where I'm downloading a file into memory and storing it into a variable.
stream = downloadBlob(blob_name_file)
def downloadBlob(blob_name_file: str):
# creating container client to access container
container_str_url = secret_client.get_secret("testString").value
container_client = ContainerClient.from_container_url(container_str_url)
# blob client to access the specific blob
blob_client = container_client.get_blob_client(blob= blob_name_file)
# downloading the blob to memory in bytes
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
return stream
After I have that I put it into a dataframe
try:
processed_df = pd.read_parquet(stream, engine='pyarrow')
except Exception as e:
print(e)
Once in the dataframe I go ahead and delete the stream variable to free up memory (or at least that's the goal)
del stream
The problem I'm facing is that I'm still using up too much memory. I'm using Azure to run my script and am limited to ~2.5gb RAM. It can run fine on one machine without memory issues, but when I push this up to run on even just 2 machines with one instance, I'm hitting memory cap at times. The part of the code that I've outlined in this post is what I assume would use the most memory, the rest of the script just passes the dataframe around basically. I even reduce the dataframes' size and del
the original.
My question is, is there a better way of doing what I'm doing at all? And is del
doing what I believe it is doing? Because I'll be honest, even after adding del
to my script, it didn't seem to affect memory usage much, if at all.