At what point is it better to query nosql by prop instead of storing a large array inside of a document & querying that document id?

Question

I am storing document in a nosql (mongo or others) datastore in json format like so

* edit start *

{
    _id : 9182798172981729871
    propertyBBBB: [
       {
           propertyCCCCC: "valueCCCC",
           propertyDDDDD: [ "valueDDDD", "valueEEEE", "valueFFFF" ]
       }, {
           propertyCCCCC: "valueGGGG",
           propertyDDDDD: [ "valueHHHH", "valueIIII", "valueFFFF" ]
       }
       ....
    ]
}


.find( { _id : "9182798172981729871" } , 
       { propertyBBBB : { propertyDDDD : {"$elemMatch":{"$in":['refineaquerystringvar']}}}} )

MongoDB nested array query

**** edit end ****

Currently I am querying by _id and I perform logic on the nested array after the fetch has returned the document.

But I am looking for more flexibility in querying so I am thinking about making a new nosql (mongo or others) collection full of documents that look like the value of propertyBBBB

* edit start *

   {
       _id: 9234792837498237498237498
       parentid: 9182798172981729871
       propertyCCCCC: "valueCCCC",
       propertyDDDDD: [ "valueDDDD", "valueEEEE", "valueFFFF" ]
   }

   {
       _id: 9234792837498237498237497
       parentid: 9182798172981729871
       propertyCCCCC: "valueCCCC",
       propertyDDDDD: [ "valueDDDD", "valueEEEE", "valueFFFF" ]
   }


.find( { parentid : "9182798172981729871" } , 
       { propertyDDDDD : {"$elemMatch":{"$in":['refineaquerystringvar']}}} )

MongoDB nested array query

**** edit end ****

But I don't want to lose my query speed because in this way of doing things I am using more logic to query with parentid as a complimentary parameter instead of the main fetch. I am also fetching many objects instead of being sure that I am fetching one every time.

So my question is:

At what point is is better to query mongo by property instead of storing a large array inside of a document and querying that document _id? How big would the length of the array (or return query) be to make it more advantagious to use one convention over the other?

What is the actual problem you are trying to solve? Using nexted document simplifies your app logic and the number of roundtrips, and data in the substrcture can be indexed -- so what precicely are you trying to do by normalizing the data? — Soren, May 27 '16 at 11:31
The problem I am trying to solve is fetching a query set of document objs that are refined by a tagname string and each share the same parent _id. If instead each item in the array was its own document in a seperate collection, then we could query that collection by tagname string and the parent _id that brings them together. But that query takes time depending on how many documents in that collection meet the criteria and also how many total documents are in the suppossed new collection. — Benjamin McFerren, May 27 '16 at 14:30
At what point does that weight (ie time taken to query bc of these variables I describe above) make it more advantagous to just list each obj in an array that is found in the parent's document (and then just refine by tagname on that array)? — Benjamin McFerren, May 27 '16 at 14:30
You should almost always use an embedded array and query against that. Trying to normalize data is almost always a mongo-beginners mistake. Could you update your question with a query (code) you intend to run, maybe that will help clarify your question — Soren, May 27 '16 at 18:38
.. like what kind of logic do you perform on the nested array after you query? — Soren, May 27 '16 at 18:57

Soren · Answer 1 · 2016-05-29T16:46:29.317

The answer really depends on the use case for your data and what you expect to retrieve in your query. Things to notice is;

MongoDb does not do joins, so anything where you have to glue data back together requires extra logic in your application, and it will take extra CPU power to do so -- so more smaller records may not speed up your application, and in fact most people experiencing that they see a much slower application using a normalized data schema than using a denormalized schema.
MongoDB does not support records over 16Mb -- so if your array structure can grow unlimited you may have a problem -- for example it would be a bad design to have an array of all users of your application.

you already uses the $elemMatch directive in your query, which is good as it reduces the datasize of what is transferred over the network to what is actually needed -- however very large record sizes could still be a problem for disk io, but in many Mongo databases the active dataset fits entirely in memory and hence the IO would be much less important, assuming that the majority of operations are reads. If the number of writes (updates) is the majority of operation, then it is worth considering that updating just one element in an array results in the entire object to be written to database, so if the record is very large then changing just one byte would result in significant IO -- collecting user events in a session would be one such use case where adding the events to an array may end up being a bad design.

If your find in the denormalized array would return multiple records (it won't in your case since you query with a _id) the application logic on your client side in a normalized schema could be very harry to build to stitch back the records and something which you would probably want to avoid.

The only benefit it can think of in the normalized model is if you have a large number of mongo shards, and you expect the find to return a large number of records, since you could parallelize the retrieval of data from multiple hosts, however the amount of data returned on each find would have to be very large for you to notice a difference.

So, in summary, I think the circumstances where you would want to normalize your data for performance reasons would be very rare at best or slim to none for most people. If you have a good understanding of your data you may want to run a benchmark test, and unless you get a substantial (x2 or x3) difference I would still go with the denormalized model simply due to the ease and simplicity of the code you have to write.

As you are asking for "official sources" I can refer to the mongodb-blog that have a series of writeups on how to design your data models, and they iterating the same points I have made above plus a few extra hints.

score 0 · Answer 2 · answered May 30 '16 at 19:13

Yes you can load the collection in a an array.
A collection is composed of documents. Each document can be mapped as an object.
Finaly you would load the mongo collection as an Array of objects.
I think there's no problem with processing a huge Array of objects in a server, all the more so since often node.js and mongoDB are hosted on the same server. So the work done while processing huge array in node balance out the work that would have be processed in mongo

At what point is it better to query nosql by prop instead of storing a large array inside of a document & querying that document id?

2 Answers2