0

I'm rebuilding my website which is a search engine for nicknames from the most active forum in France: you search for a nickname and you got all of its messages.

My current database contains more than 60Gb of data, stored in a MySQL database. I'm now rewriting it into a mongodb database, and after retrieving 1 million messages (1 message = 1 document) find() started to take a while.

The structure of a document is as such:

{
  "_id" : ObjectId(),
  "message": "<p>Hai guys</p>",
  "pseudo" : "mahnickname", //from a nickname (*pseudo* in my db)
  "ancre" : "774497928", //its id in the forum
  "datepost" : "30/11/2015 20:57:44"
}

I set the id ancre as unique, so I don't get twice the same entry.

Then the user enters the nickname and it finds all documents that have that nickname.

Here is the request:

Model.find({pseudo: "danickname"}).sort('-datepost').skip((r_page -1) * 20).limit(20).exec(function(err, bears)...

Should I structure it differently? Instead of having one document for each message, I'm having a document for each nickname and I update the document once I get a new message from that nickname?

I was using the first method with MySQL et it wasn't taking that long.

Edit: Or maybe should I just index the nicknames (pseudo)?

Thanks!

Community
  • 1
  • 1
Sinequanon
  • 125
  • 2
  • 9

1 Answers1

1

Here are some recommendations for your problem about big data:

  1. The ObjectId already contains a timestamp. You can also sort on it. You could save on some disk space by removing the datepost field.
  2. Do you absolutely need the ancre field? The ObjectId is already unique and indexed. If you absolutely need it and need to keep the datepost seperate too, you could replace the _id field to be your ancre field.
  3. As many mentioned, you should add an index on pseudo. This will make the "get all messages where the pseudo is mahnickname" search much faster.
  4. If the amount of messages per user is low, you could store all of them inside a single Document per user. This would avoid having to skip to a specific page, which can be slow. However, be aware of the 16mb limit. I would personally still have them in multiple documents.
  5. To keep fast query speeds, ensure that all your indexed fields fit in RAM. You can see the RAM consumption of indexed fields by typing db.collection.stats() and looking at the indexSizes sub-document.
  6. Would there be a way for you to not skip documents, but use the time it got written to the database as your pages? If so, use the datepost field or the timestamp in _id for your paging strategy. If you decide on using the datepost, make a compound index on pseudo and datepost.

As for your benchmarks, you can closely monitor MongoDB by using mongotop and mongostat.

Cydrick Trudel
  • 9,957
  • 8
  • 41
  • 63