0

I have a scheduled task that pulls data from a database from the past 60 days, and creates a parquet file using that data. Currently, every time the scheduled task runs this script, the last parquet file is overwritten and this takes both time and resources to complete, so is there a way that I can append new data to the end of the existing parquet file, and remove data that is older than 60 days from the parquet file at the same time?

The date of the data is stored in a column of the data itself, so finding out the date of the data, and seeing if it is older than 60 days is something that I think I can do using pandas, but otherwise, is there a quick way to go about it? Below is my current code for just creating the parquet file with the df of pulled data.

df.to_parquet(r'{}\{}\{}_data.parquet'.format(basepath, product, product)) 
The Singularity
  • 2,428
  • 3
  • 19
  • 48
Sultanust
  • 33
  • 2

1 Answers1

0

Parquet files can't be modified. So if you have only one file you have to recreate it from scratch every time. See this question

If you want to speed up the process, you can load the data in memory, drop the old data and append the new one using pandas. But this is bug prone and makes the job rely on the previous day results.

An alternative would be to save one file per day, and have a job that concatenate the last 60 days into one parquet file every day.

0x26res
  • 11,925
  • 11
  • 54
  • 108