4

AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an hour that are ~100kb each).

Also the _delta_log folder has grown to ~170TB, mostly of files with the name ***.checkpoint.***.parquet, that it has been creating every time we write to the table (hourly batch job, checkpoints are ~50GB an hour). So basically I want to get rid of that 170TB, and ideally also clean up/optimize the many small files. (also we don't really care about time travel for this table)

I've tried setting delta.logRetentionDuration to 30 days and running VACUUM, which cleaned up a little, but there are still a bunch of checkpoints in the _delta_log folder that are up to 6 months old. But I'd still like to get rid of (almost) all of them. I've tried running OPTIMIZE x where utc_date = '2020-01-01 (data is partitioned on utc_date) on a single day of data, but it took half an hour, which extrapolating to running it on the entire table would take weeks... I've also tried copying a single day of data to a new table which was much faster (1-2 min per day), which lead me to think a good option may be to just copy all the data to a new table.

So to summarize, what do you think is the best/most efficient way to to fix this table, and to deal with such a blown up delta log?

Fenno Vermeij
  • 126
  • 3
  • 4
  • 11
  • Do you have extra partitioning levels beside `utc_date` ? – Alex Ott Aug 12 '22 at 18:09
  • There is another partitioning level, under utc_date, that has between 5 and 10 partitions per day – Fenno Vermeij Aug 15 '22 at 07:27
  • I'm not sure why you've got so many `checkpoint`s since they're to build a view of the state of the delta table and hence IMHO only the ones that are after the versions you're left with (after `VACUUM`) are really needed. Can you include the file listing of the directory? Also, log and data directories are different in purpose and are (should be?) cleaned up differently. – Jacek Laskowski Aug 15 '22 at 18:25

0 Answers0