12

I've seen many posts on how to do many-to-many relationships with MongoDB, but none of them mention scale. For example these posts:

MongoDB Many-to-Many Association

How to organise a many to many relationship in MongoDB

The problem I can see with this kind of setup is MongoDB's 16MB document limit. Say I have users, groups, and posts. posts have an associated group and many users that can like it. A group has many posts in it, and many users that can follow it. A user can have many liked posts and can follow many groups. If I were to build this with a relational database I would set it up like this:

user:
    user_id
    username

post:
    post_id
    group_id
    message

group:
    group_id
    name

post_likes:
    post_id
    liked_user_id

group_followers:
    group_id
    follower_user_id

In theory, a group can have an ulimited number of posts and following users, a post can have an unlimited number of liked users, and a user can have an unlimited number of liked posts and groups that they are following if pagination is done correctly in the SQL queries.

How can I setup the schema of MongoDB so that this sort of scale can be achieved?

Community
  • 1
  • 1
mverderese
  • 5,314
  • 6
  • 27
  • 36
  • After some more looking, seems like GridFS might be the way to go? http://docs.mongodb.org/manual/core/gridfs/ – mverderese Aug 08 '15 at 01:47
  • 1
    Grifds is the most horrible solution to deal with this problem. Proper data modeling, as you started to detail it is waaaay better. – Markus W Mahlberg Aug 08 '15 at 08:27

2 Answers2

18

This is a good question which illustrates the problems with overemebedding and how to deal with it.

Example: Post likes

Let's stick with the example of users liking posts, which is a simple example. The other relations would have to be handled accordingly.

You are absolutely right that with storing the likes inside the post would sooner or later lead to the problem that very popular posts would reach the size limit.

So you correctly fell back to create a post_likes collection. Why do I call this correct? Since it fits your use cases and functional and non-functional requirements!

  • It scales indefinetly (well, there is a theoretical limit, but it is humongous)
  • It is easy to maintain (create a unique index over post_id and liked_user_id) and use (both the user and the post are known, so adding a like is a simple insert or more likely an upsert)
  • You are able to easily find out which users like which post and which post is liked by which users

However I would expand the collection a bit to prevent unneeded queries for certain use cases which are frequent.

Let's assume for now that post titles and usernames can't be changed. In that case, the following data model could make more sense

{
  _id: new ObjectId(),
  "post_id": someValue,
  "post_title": "Cool thing",
  "liked_user_id": someUserId,
  "user_name": "JoeCool"
}

Now let's assume you want to display the username of all users that liked a post. With the model above, that would be a single, rather fast query:

db.post_likes.find(
  {"postId":someValue},
  {_id:0,user_name:1}
)

With only the IDs stored, this rather usual task would need at least two queries and - given the constraint that there can be an infinite number of likers for a post - potentially huge memory consumption (you'd need to store the user IDs in RAM).

Granted, this leads to some redundancy, but even when millions of people like a post, we are talking only of a few megabytes of relatively cheap (and easy to scale) disk space while gaining a lot of performance in terms of user experience.

Now here comes the thing: Even if the user names and post titles are subject to change, you only had to do a multi update:

db.post_likes.update(
  {"post_id":someId},
  { $set:{ "post_title":newTitle} },
  { multi: true}
)

You are trading that it takes a while to do some rather rare stuff like changing a username or a post for extreme speed for use cases which happen extremely often.

Bottom line

Keep in mind that MongoDB is a document oriented database. So document the events you are interested in with the values you need for future queries and model your data accordingly.

Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
1

If you're just storing the ID's of the relationships inside the arrays of reach collection you shouldn't have much of a problem within a single document. GridFS can be used but that's usually more for media like files, music, videos, etc. using GridFS would make doing updates a pain

ThrowsException
  • 2,586
  • 20
  • 37