20

I'm documenting about the GridFS and the possibility to shard it among different machines.

Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.

In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).

what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?

thank you.

ALoR
  • 4,904
  • 2
  • 23
  • 25

3 Answers3

17

You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).

We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.

kris
  • 23,024
  • 10
  • 70
  • 79
5

You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:

There are different ways that GridFS can be sharded, depending on the need. One common way to shard, based on pre-existing indexes, is:

"files" collection is not sharded. All file records will live in 1 shard. It is highly recommended to make that shard very resilient (at least 3 node replica set) "chunks" collection gets sharded using the existing index "files_id: 1, n: 1". Some files at the end of ranges may have their chunks split across shards, but most files will be fully contained within the same shard.

Community
  • 1
  • 1
Andrew Orsich
  • 52,935
  • 16
  • 139
  • 134
  • i thought about the _filename_, but it is in the _files_ collections and not in the _chunks_ which is the collections that needs sharding. – ALoR Mar 17 '11 at 22:01
  • i thought so ;) But files small collection and it will live at one shard. And i see only two shard keys for gridfs: fileId and fileId,n. – Andrew Orsich Mar 17 '11 at 22:05
0

Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't work across shards yet. So you cannot split single file across shards. Answer on google group7

Yaro Nosa
  • 372
  • 3
  • 8