Mongodb: big data structure

Question

I'm rebuilding my website which is a search engine for nicknames from the most active forum in France: you search for a nickname and you got all of its messages.

My current database contains more than 60Gb of data, stored in a MySQL database. I'm now rewriting it into a mongodb database, and after retrieving 1 million messages (1 message = 1 document) find() started to take a while.

The structure of a document is as such:

{
  "_id" : ObjectId(),
  "message": "<p>Hai guys</p>",
  "pseudo" : "mahnickname", //from a nickname (*pseudo* in my db)
  "ancre" : "774497928", //its id in the forum
  "datepost" : "30/11/2015 20:57:44"
}

I set the id ancre as unique, so I don't get twice the same entry.

Then the user enters the nickname and it finds all documents that have that nickname.

Here is the request:

Model.find({pseudo: "danickname"}).sort('-datepost').skip((r_page -1) * 20).limit(20).exec(function(err, bears)...

Should I structure it differently? Instead of having one document for each message, I'm having a document for each nickname and I update the document once I get a new message from that nickname?

I was using the first method with MySQL et it wasn't taking that long.

Edit: Or maybe should I just index the nicknames (pseudo)?

Thanks!

You should add an index to the `pseudo`. Checkout out this post here, on the specifically the selectivity area: http://stackoverflow.com/questions/33545339/how-does-the-order-of-compound-indexes-matter-in-mongodb-performance-wise/33546159#33546159 — Abdullah Rasheed, Nov 30 '15 at 20:17
Note that `skip`actually has to parse over the documents it skips, so if `r_page` is really big then `skip` will need to jump over a bunch of documents. — David says Reinstate Monica, Nov 30 '15 at 20:19
Ahah I completely forgot about indexing the nickname. Thanks mate, I'll try it now — Sinequanon, Nov 30 '15 at 20:19
@DavidGrinberg r_page can be really big yes, depending on the messages the nickname has posted. What do you suggest? — Sinequanon, Nov 30 '15 at 20:20
@Sinequanon See Mongo's notes on the topic [here](https://docs.mongodb.org/manual/reference/method/cursor.skip/). Also see if you can add more data to filter by. Plus indexing. — David says Reinstate Monica, Nov 30 '15 at 20:23
Thanks mate. Alright thanks, I'll do that. Plus, is the second option I'm suggesting would be better? — Sinequanon, Nov 30 '15 at 20:26
@Sinequanon Won't hurt to try. At the end of the day nothing beats a benchmark test. — David says Reinstate Monica, Nov 30 '15 at 20:39

Cydrick Trudel · Answer 1 · 2015-11-30T21:46:47.567

Here are some recommendations for your problem about big data:

The ObjectId already contains a timestamp. You can also sort on it. You could save on some disk space by removing the datepost field.
Do you absolutely need the ancre field? The ObjectId is already unique and indexed. If you absolutely need it and need to keep the datepost seperate too, you could replace the _id field to be your ancre field.
As many mentioned, you should add an index on pseudo. This will make the "get all messages where the pseudo is mahnickname" search much faster.
If the amount of messages per user is low, you could store all of them inside a single Document per user. This would avoid having to skip to a specific page, which can be slow. However, be aware of the 16mb limit. I would personally still have them in multiple documents.
To keep fast query speeds, ensure that all your indexed fields fit in RAM. You can see the RAM consumption of indexed fields by typing db.collection.stats() and looking at the indexSizes sub-document.
Would there be a way for you to not skip documents, but use the time it got written to the database as your pages? If so, use the datepost field or the timestamp in _id for your paging strategy. If you decide on using the datepost, make a compound index on pseudo and datepost.

As for your benchmarks, you can closely monitor MongoDB by using mongotop and mongostat.

Mongodb: big data structure

1 Answers1