AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an hour that are ~100kb each).
Also the _delta_log folder has grown to ~170TB, mostly of files with the name ***.checkpoint.***.parquet
, that it has been creating every time we write to the table (hourly batch job, checkpoints are ~50GB an hour). So basically I want to get rid of that 170TB, and ideally also clean up/optimize the many small files. (also we don't really care about time travel for this table)
I've tried setting delta.logRetentionDuration
to 30 days and running VACUUM
, which cleaned up a little, but there are still a bunch of checkpoints in the _delta_log folder that are up to 6 months old. But I'd still like to get rid of (almost) all of them.
I've tried running OPTIMIZE x where utc_date = '2020-01-01
(data is partitioned on utc_date) on a single day of data, but it took half an hour, which extrapolating to running it on the entire table would take weeks...
I've also tried copying a single day of data to a new table which was much faster (1-2 min per day), which lead me to think a good option may be to just copy all the data to a new table.
So to summarize, what do you think is the best/most efficient way to to fix this table, and to deal with such a blown up delta log?