I have a scheduled task that pulls data from a database from the past 60 days, and creates a parquet file using that data. Currently, every time the scheduled task runs this script, the last parquet file is overwritten and this takes both time and resources to complete, so is there a way that I can append new data to the end of the existing parquet file, and remove data that is older than 60 days from the parquet file at the same time?
The date of the data is stored in a column of the data itself, so finding out the date of the data, and seeing if it is older than 60 days is something that I think I can do using pandas, but otherwise, is there a quick way to go about it? Below is my current code for just creating the parquet file with the df of pulled data.
df.to_parquet(r'{}\{}\{}_data.parquet'.format(basepath, product, product))