0

I need to clean a mongodb collection of 200Tb, and delete older timstamp. I am trying to build a new collection from the new, and run a delete query, since, running a del on the present collection that is in use, will slow down the other requests to it. I have thought of cloning a new collection either by taking a dump of the following collection, or by create a read and and write script, such that, it will read from the present collection and write to the cloned collection. My question is is a read/write operation of a batch ex: 1000 read and write faster than a dump ?

EDIT: I found this, this and this article, and want to know, if writing a script in the above mentioned way the same as creating a ssh pipe of read and write ? ex: is a node/python script to fetch 1000 rows from a collection and insert that to a clone collection the same as ssh *** ". /etc/profile; mongodump -h sourceHost -d yourDatabase … | mongorestore -h targetHost -d yourDatabase ?

isnvi23h4
  • 1,910
  • 1
  • 27
  • 45
  • How many data you need to delete? How do you create the clone? (do you have sufficient disc space for 2 times 200 TB?) – Wernfried Domscheit Mar 21 '22 at 10:33
  • @WernfriedDomscheit I am not sure on how many data to delete, all the datas before a particular timestamp would be deleted, we can asume 50-80TB will be deleted. yes, we have sufficient disc space for the operation. – isnvi23h4 Mar 21 '22 at 10:36
  • @WernfriedDomscheit added update at the end of the question. – isnvi23h4 Mar 21 '22 at 10:43

1 Answers1

1

I would suggest this approach:

  • Rename the collection. Your application will immediately create a new empty collection with the old name when it tries to insert some data. You may create some indexes.
  • Run mongoexport/mongoimport to import the valid data, i.e. skip the outdated.

Yes, in general mongodump/mongorestore might be faster, however at mongoexport you can define a query and limit the data which is exported. Could be like this:

mongoexport --uri "..." --db=yourDatabase --collection=collection --query='{timestamp: {$gt: ISODate("2022-01-010")}}' | mongoimport --uri "..." --db=yourDatabase --collection=collection --numInsertionWorkers=10

Utilize parameter numInsertionWorkers to run multiple workers. It will speed up your inserts.

So you run a sharded cluster? If yes, then you should use sh.splitAt() on the new collection, see How to copy a collection from one database to another in MongoDB

Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110
  • Is there a way to calculate the number of process/workers to allocate for this task and the time it will take ? – isnvi23h4 Mar 21 '22 at 11:04
  • 1
    Maybe derive from the number of CPU's. Let's say 5 times CPU – Wernfried Domscheit Mar 21 '22 at 11:08
  • Seems a great solution, however there is one downside to this and that is, if the ssh command/ piping fails or gets aborted for some reason, then the copy will be incomplete, is there a suggestion for that ? can we make that into a background running process until the completion ? – isnvi23h4 Mar 21 '22 at 11:15
  • With the normal command, i.e. start with appending `&` and optionally `nohup`. Ensure to catch the output which shows progressing. If it fails, the simply run it again, existing documents will fail, because `_id` must be unique. – Wernfried Domscheit Mar 21 '22 at 12:07