19

I read that embedding is better from a performance point of view: "If performance is an issue, embed." (http://www.mongodb.org/display/DOCS/Schema+Design) and most guides always say contains should be embedded.

However I am not sure this is the case. Suppose we have two objects: Blog and Post. Blog contains posts.

Now making all posts embedded in blog will have the following issues:

  1. Paging. Since it's not possible to filter embedded objects, we will always get all posts and need to filter them out in the application.
  2. Filtering. Same as before, when searching for word inside posts, it will not be possible to filter the embedded collection from MongoDB.
  3. Insert. I assume inserting to collection is faster than inserting to embedded object. Is this correct? this is written anywhere?
  4. Update. Same as before, inline updating field inside smaller document (Post) might be faster then inline updating the post inside big document of Blog. Is this correct?

Taking all of the above, I would go for having posts in a separate collection referencing Blog. Is this the correct conclusion?

(Note: Please do not factor document size limit in the response, let's assume each blog will have at most 1000 posts)

franzlorenzon
  • 5,845
  • 6
  • 36
  • 58
mbdev
  • 6,343
  • 14
  • 46
  • 63

3 Answers3

15

1.Paging possible with $slice operator:

db.blogs.find({}, {posts:{$slice: [10, 10]}}) // skip 10, limit 10

2.Filtering also possible:

db.blogs.find({"posts.title":"Mongodb!"}, {posts:{$slice: 1}}) //take one post

3,4. Generally i guess you are speaking about small performance difference. It's not rocket science, it just blog with at most 1000 posts.

You said:

Is this the correct conclusion?

No, if you care about performance (in general if system will be small you can go with separate document).

I've done small performance test regarding 3,4, here is results:

-----------------------------------------------------------------
| Count/Time |  Inserting posts   | Adding to nested collection |
-------------|--------------------------------------------------               
|   1        |   1 ms             |  28 ms                      |
|   1000     |   81 ms            |  590 ms                     |
|   10000    |   759 ms           |  2723 ms                    |
 ---------------------------------------------------------------
Andrew Orsich
  • 52,935
  • 16
  • 139
  • 134
  • Are you sure #2 returns the blog with one post that matches the title? I would think it returns the blog containing a post with "Mongodb!" title. Then slice would just filter the first post. So you will get incorrect post – mbdev Jun 16 '11 at 15:02
  • See here: http://stackoverflow.com/questions/2138454/filtering-embedded-documents-in-mongodb – mbdev Jun 16 '11 at 15:03
  • @mbdev: #2 just have fake query. I just showing you how it can be done. – Andrew Orsich Jun 16 '11 at 16:23
  • @Andrew have you checked the link I posted? – mbdev Jun 19 '11 at 06:42
  • @Andrew see also here: http://codefudging.posterous.com/understanding-embedded-documents-in-mongodb – mbdev Jun 19 '11 at 06:45
  • @mbdev: I see what you want to say. Point #2 will return first post in 'natural' and it for sure can be post with another title. So, dot notation way to query on root documents. So, you need load blog and search on client side ;(. $slice good fit if know that query will return one blog, and then you can make paging on nested collection in natural order. – Andrew Orsich Jun 19 '11 at 08:18
  • So would that change your conclusion when filtering is needed (and in the case of blog and posts, most operations are done on posts) that a separate collection will be better? – mbdev Jun 19 '11 at 10:45
  • @Andrew you still around? I would like to accept this answer after the corrections – mbdev Jun 21 '11 at 10:43
  • @mbdev: Okay, i'll do it a little bit later and will inform you. Thanks. – Andrew Orsich Jun 21 '11 at 12:57
3

As for 3 & 4, if you are inserting into a nested document, it is basically an update.

This can be terribly bad for your performance because inserts are generally appended to the end of the data which works fine and fast. Updates, on the other hand, can be much trickier.

If your update does not change the size of a document (meaning that you had a key\value pair and simply changed the value to a new value that takes up the same amount of space) then you will be ok but when you start modifying documents and adding new data, a problem arises.

The problem is that while MongoDB allots more space than it needs for each document, it may not be enough. If you insert a document that is 1k large, MongoDB may allot 1.5k for the document to ensure that minor changes to the document have enough space to grow. If you use more than the allocated space, MongoDB has to fetch the entire document and re-write it at the tail end of the data.

There is obviously a performance implication in fetching and re-writing the data which will be amplified by the frequency of such an operation. To make matters worse, when this happens you end up leaving holes or pockets of unused space in your data files.

This ultimately gets copied into memory which means that you may end up using 2GB of RAM to store your data set, while in reality the data itself only takes up 1.5GB because there are .5GB worth of pockets. This fragmentation can be avoided by doing inserts as opposed to updates. It can also be fixed by doing a database repair.

In the next version of MongoDB there will be an online compaction function.

Bryan Migliorisi
  • 8,982
  • 4
  • 34
  • 47
  • You think the numbers will be worse than those by Andrew? – mbdev Jun 16 '11 at 12:50
  • Impossible to say - i depends 100% on your data structure and size of your documents and embedded documents. Once you try to insert a document larger than the allotted free space, youre going to see performance drop on your writes. I think this will be difficult to prove with a small test with relatively small data set. – Bryan Migliorisi Jun 16 '11 at 14:19
1
  1. You can paging with '$slice' on embedded element
  2. You can search with "field1.field2": /aRegex/ with aRegex is the word you search. But take care of performance.

About 3. and 4. I have no proof data.

BTW 2 collections can be easier to code/use/manage. And you can simply register blogId in each 'blog' document and add "blogId":"1234ABCD" in all your query

Aurélien B
  • 4,590
  • 3
  • 34
  • 48