3

We have a replica set of 1 primary, 1 secondary, and 1 arbiter. We delete collections often, so I am looking for a fast way to reclaim disk space used by deleted collections with no downtime, current database size is close to 3TB. I've been researching various ways of doing this, 2 common approaches are:

  1. repairDatabase(): which needs free space equal the size of used space to be able to run, it will take the primary offline, then start initial Sync on the secondary,which is very lengthy process, during which only one node is available for read only from secondary during repairDatabase, and read/write during initial Sync.

  2. run initial Sync on a new node, then claim as primary and retire the old one. Repeat the process for secondary. With this option, both primary and secondary are available, but very lengthy process and take almost 1 week to run initial Sync twice.

is there a better solution to reclaim disk space on a regular basis and relatively faster than the above solutions.

Note that every single collection is subject to deletion.

Thanks

Kamal
  • 71
  • 6
  • New attempt at a better link ;-) http://docs.mongodb.org/manual/reference/command/compact/ ? – Joachim Isaksson Jun 09 '14 at 17:53
  • This will maybe sound stupid, but why do you want to reclaim disk space? – Christian P Jun 09 '14 at 18:00
  • db is growing daily, and we do a lot of delete, if we don't reclaim this space we end up with a high storage consumption rate. – Kamal Jun 09 '14 at 18:05
  • @Kamal - what's your network speed between replica set members? – Christian P Jun 09 '14 at 18:14
  • @ChristianP roughly 115 MB/s – Kamal Jun 09 '14 at 19:07
  • Why do you think that MongoDB does not reuse disk space? It does - you cannot reclaim the space without taking one of the actions you outlines, but internally it will mark that space as deleted after you drop a collection and will re-use it - I see no reason why disk space usage will continue to grow. – John Petrone Jun 10 '14 at 05:17
  • @John, My understanding MongoDB re-use disk space used by deleted collections only under certain conditions, yes the space will eventually be reused, but the rate of adding new collections is higher than the reuse rate so the disk space will continue to grow http://stackoverflow.com/questions/13390160/does-mongodb-reuse-deleted-space – Kamal Jun 10 '14 at 15:16
  • 1
    MongoDB will generally re-use disk space, unless it cannot for some reason (like the slots open are too small). As the answer you linked to discuss, there are plenty of tools with MongoDB to make certain your initial document storage allocation is high enough for a high percent probability of re-use after deletion. While a nice resync now and again is good for compaction (like a periodic defrag on a disk) properly managed MongoDB should do a pretty good job for you managing the disk space. – John Petrone Jun 11 '14 at 17:54

2 Answers2

1

there's no easy way to achieve this, unless you design your DB structure to keep different collections in different databases, which in turn will mean storing them in different paths in your HDD as long as you have the directoryPerDB set to true in your mongo.conf. This is a workaround and depending on your app it might be unpractical.

While it's true that dropping a collection won't free the hdd space, it's also true that the used space it's not lost. It will be eventually reused for new collections.

That being said, unless you are really short on space, don't reclain that space. The CPU and I/O cost of doing that regularly is far more expensive than the storage capacity cost in every provider I know of.

ffflabs
  • 17,166
  • 5
  • 51
  • 77
1

I'd take a look at using MongoDB's sharding functionality to address some of your issues. To quote from the documentation:

Sharding is a method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

While sharding is frequently used to balance thru put for a large collection across more servers, to avoid hot spots and spread the overall load, it's also useful for managing storage for large collections. In your specific case I'd investigate the use of shard tags to pin a collection to a specific shard.

Again, to quote the documentation, shard tags are useful to

isolate a specific subset of data on a specific set of shards.

For example, let's say you split your production environment into a couple of shards, shard1 and shard2. You could, using shard tags and the sharding tools, pin the collections that you frequently delete onto shard2. In this use case shard1 contains all your normal collections. When you then choose to reclaim disk storage via your second option, you'd perform this only on the shard that has the deleted collections - that way you avoid have to recreate more static data. It should run faster that way (how much faster is a function of how much data is in the deleted collections shard at any given time).

It also has the secondary benefit that as each shard (actually replica set within each shard) requires smaller servers as they only contain a subset of the overall data.

The specifics of the best way to do this will be driven by your exact use case - number and size of collections, insert, update, query and deletion frequency, etc. I described a simple 2 shard case but you can do this with many more shards. You can also have some shards running on higher performance hardware for collections that have more transaction volume.

I can't really do sharding justice within the limited space here other than to point you in the right direction to investigate it. MongoDB has a lot of good information within their documentation and their 2 online DBA courses (Which are free) get into this in some detail.

Some useful links:

http://docs.mongodb.org/manual/core/sharding-introduction/

http://docs.mongodb.org/manual/core/tag-aware-sharding/

John Petrone
  • 26,943
  • 6
  • 63
  • 68
  • Thanks for the answer. I agree with the many solutions Sharding provides for large data sets. However, when reclaiming disk space, in our case, I don't see a solution as every single collection is subject to deletion, which results in running reclaim disk space process on every shard (regardless what/how that process is). It will run faster on each shard but the total is likely to be the same as running it once on a single replica set with large data sets. – Kamal Jun 10 '14 at 04:53
  • 1
    I'd suggest you add the "every single collection is subject to deletion" to the question as that will help others craft another solution. – John Petrone Jun 10 '14 at 05:14