I have a number of incoming data feeds that i receive daily. Each file is between 10-150MB. For each of the files I get I am appending it to a relevant parquet dataset for that file. In practice this means reading the days new file into a pandas dataframe, reading the existing parquet dataset into a dataframe, appending the new data to the existing, and rewriting the parquet. From what ive seen there is no way to append to a parquet dataset or i would have done that. Given the daily file sizes are relatively small i think it would be a waste to simply write them as their own partitions to the dataset - i would be adding 2-20MB parquet files daily and my understanding is that this is too small for a parquet file and it would result in significant overheard to have this many partitions.
Ive been running with my existing setup for a while now and reading the existing parquet files into memory is actually becoming quite expensive where i end up with multi GB dataframes.
My plan from here was to define a new partition on the existing unpartitioned datasets (such as year or year/quarter) and then when running the process only read in the relevant partition for the new data, append the new data to it, and rewrite that partition only.
I'm fairly certain this will work and solve my issues, but do see it being a bit of work to ensure it works correctly and scales to all my uses/datasets. Before moving forward with this I wanted to see if there was some other cleaner/simpler way to incrementally add to parquet datasets with pandas/pyarrow/dask.