I ended up manually deleting some delta lake entries(hosted on S3) . Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system. I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html but I am not sure how should I run this utility in my case.
1 Answers
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN
will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN
and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks. you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN
from the above command and it should do the stuff that you wat.

- 2,689
- 2
- 20
- 35