0

I have a number of incoming data feeds that i receive daily. Each file is between 10-150MB. For each of the files I get I am appending it to a relevant parquet dataset for that file. In practice this means reading the days new file into a pandas dataframe, reading the existing parquet dataset into a dataframe, appending the new data to the existing, and rewriting the parquet. From what ive seen there is no way to append to a parquet dataset or i would have done that. Given the daily file sizes are relatively small i think it would be a waste to simply write them as their own partitions to the dataset - i would be adding 2-20MB parquet files daily and my understanding is that this is too small for a parquet file and it would result in significant overheard to have this many partitions.

Ive been running with my existing setup for a while now and reading the existing parquet files into memory is actually becoming quite expensive where i end up with multi GB dataframes.

My plan from here was to define a new partition on the existing unpartitioned datasets (such as year or year/quarter) and then when running the process only read in the relevant partition for the new data, append the new data to it, and rewrite that partition only.

I'm fairly certain this will work and solve my issues, but do see it being a bit of work to ensure it works correctly and scales to all my uses/datasets. Before moving forward with this I wanted to see if there was some other cleaner/simpler way to incrementally add to parquet datasets with pandas/pyarrow/dask.

matthewmturner
  • 566
  • 7
  • 21
  • Question is it possible to receive today data for few days ago too? I mean are you going to deal with update and/or duplicate? If not I suggest you to keep a separate file for every date. – rpanai Apr 09 '20 at 19:44
  • @rpanai maybe possible but highly unlikely and would require some creative work from upstream systems who are feeding me the data. I could do a different file for each data but then I would very quickly have hundreds of files that are only 2-20mb each which is quite small for a parquet file and this would have a lot of overhead to read in that many files. (ref: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files) – matthewmturner Apr 09 '20 at 21:21
  • I've been to a similar situation and, given it's true that 20 mb is not the ideal size, trust me that's less problematic have one file per day. Then it depends on how do you need to process this data. I guess it's needed for dask and/or Athena. – rpanai Apr 09 '20 at 21:39
  • Or you could have two folders: on the first one you have one file per day and on the second you have year/month.parquet format. So you just need to add the daily to the current month and every month will weigh about 600Mb. – rpanai Apr 09 '20 at 22:32
  • Thx for thoughts. What would be value add in keeping daily while updating monthly? Are you saying to make it two process? One to add daily and one to read dailies and then update partitioned? – matthewmturner Apr 09 '20 at 23:28
  • It could be the same process. Once you receive the daily data you save on the daily folder and update the file in year=xxxx/month=x/file.parquet with the new data. So you'll have the original data untouched and the processed/cleaned one to exploit. – rpanai Apr 09 '20 at 23:55

0 Answers0