4

At http://www.rethinkdb.com/docs/data-modeling/, states:

Because of the previous limitation, it's best to keep the size of the posts array to no more than a few hundred documents.

If I intend on keeping 90 days (3 months) of statistics, and its likely that each date has an embedded array of around 10 regions. That means 90*10=900. 900 isn't exactly a few hundred.

However a related question at MongoDB relationships: embed or reference? suggests that MongoDB has a limit of 16mb, which translates to being able to host 30 million tweets or roughly 250,000 typical Stackoverflow questions as embedded documents. That's a lot!

However, that is MongoDB. RethinkDB has a limit of 10mb per document. Which should still be considerably high. Either the RethinkDB's documentation might be flawed. Or there is another specific reason (not explained) why Rethinkdb is suggesting only to keep it down to a few hundred embedded arrays, even though 10mb can clearly hold a lot more than that.

A rough idea of the schema I was referring to:

DailyStat::Campaign
[
  {
    id: '32141241dkfjhjksdlf',
    days_remaining: 26,
    status: 'running',
    dates: [
      {
        date: 20130926,
        delivered: 1,
        failed: 1,
        clicked: 1,
        top_regions: [
          { region_name: 'Asia', views: 10 },
          { region_name: 'America', views: 10 },
          { region_name: 'Europe', views: 10 },
          { region_name: 'Africa', views: 10 },
          { region_name: 'South East Asia', views: 10 },
          { region_name: 'South America', views: 10 },
          { region_name: 'Northern Europe', views: 10 },
          { region_name: 'Middle East', views: 10 }
        ]
      },
      {
        date: 20130927,
        delivered: 1,
        failed: 1,
        clicked: 1,
        top_regions: [
          { region_name: 'Asia', views: 10 },
          { region_name: 'America', views: 10 },
          { region_name: 'Europe', views: 10 },
          { region_name: 'Africa', views: 10 },
          { region_name: 'South East Asia', views: 10 },
          { region_name: 'South America', views: 10 },
          { region_name: 'Northern Europe', views: 10 },
          { region_name: 'Middle East', views: 10 }
        ]
      },
      ...
    ]
  }
]
Community
  • 1
  • 1
Christian Fazzini
  • 19,613
  • 21
  • 110
  • 215

2 Answers2

5

Short answer:

That post is referring to the size of each embedded array not the sum of their sizes. So in your case the size is only 10 which will certainly be fine.

Longer anser:

The problem with having a large nested array in a document (really just a large document in general there's nothing special about arrays) is that it makes it slow if you need to update it. RethinkDB doesn't do partial updates right now so anytime you want to update the document it will require reading the entire thing of disk and writing the entire thing to disk. Similarly this can be a problem if you frequently read a document but only care about a very small part of it. If for example you have a very large array in a document but also a small field you need to read very often from it every time you try to read the small field you'll pay the penalty of reading the large array.

Joe Doliner
  • 2,058
  • 2
  • 15
  • 19
  • Keep in mind. A campaign has 90 days (embedded array). Each day can have up to 10 regions (embedded array inside day). Thats 90*10. 900 is the size of the arrays in a campaign document. Not 10 – Christian Fazzini Sep 27 '13 at 01:39
  • Ahh, with 900 total this might get to be inefficient. It depends on how frequently you update the values in the document. – Joe Doliner Sep 27 '13 at 21:17
1

The "previous limitation" mentioned here refers to the following:

Disadvantages of using embedded arrays: Any operation on an author document requires loading all posts into memory. Any update to the document requires rewriting the full array to disk.

This is less about limitation, more about performance trade off.

For example, if you are embedding each user's tweets in the user table, you may run into performance issues, because:

  1. the embedding of tweets makes a user document large
  2. everytime you insert a tweet, you have to update the whole user document (which is large)
  3. there are probably many insertions of tweets per user per day
  4. multiply this by total number of users

On the other hand, if you store tweets in a separate table, each insertion is small and cheap.

In your instance, you are storing stats on a daily basis. Updating one single document a few times a day shouldn't cause any performance problem, even if it's a few MB.

jackbean818
  • 304
  • 3
  • 7