1

How to find and hard delete objects older than n-days in LakeFS? Later it'll be a scheduled job.

awrow
  • 83
  • 6

1 Answers1

4

To do that you should use the Garbage Collection (GC) feature in lakeFS.

Note: This feature cleans objects from the storage only after they are deleted from your branches in lakeFS.

You will need to:

  1. Define GC rules to set your desired retention period.

    From the lakeFS UI, go to the repository you would like to hard delete objects from -> Settings -> Retention, and define the GC rule for each branch under the repository. For example -

    {
        "default_retention_days": 21,
        "branches": [
            {"branch_id": "main", "retention_days": 28},
            {"branch_id": "dev", "retention_days": 7}
        ]
    }
    
  2. Run the GC Spark job that does the actual cleanup, with -

    spark-submit --class io.treeverse.clients.GarbageCollector \
      -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
      -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
      -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
      -c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
      -c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
      --packages io.lakefs:lakefs-spark-client-301_2.12:0.5.0 \
      example-repo us-east-1
    
johnnyaug
  • 887
  • 8
  • 24
Tal Sofer
  • 76
  • 6
  • ty, but how actually find and delete this ogjects? – awrow Nov 28 '21 at 11:01
  • 2
    You can use the [Python API client](https://docs.lakefs.io/integrations/python.html) lakeFS provides to list objects, filter them by mtime, and delete them. If this does not support the scale of your environment, you can use the [Metadata client](https://github.com/treeverse/lakeFS/tree/master/clients/spark) to achieve the same goal from Spark. – Tal Sofer Nov 28 '21 at 14:49
  • Sounds good ty! – awrow Nov 28 '21 at 14:51