2

I am seeking to import a large collection of nested JSON objects into a MongoDB database. It is common practice, under certain circumstances, to represent these relationships using referenced collections rather than directly embedded documents.

Here is a concrete example. Suppose I had tens of gigabytes worth of JSON in the following format, where children is occasionally thousands of objects long, and each object has dozens of keys.

{
    "a" : 1,
    "b" : 2,
    "children" : [
       {
         "x": "some long, complicated thing",
         "y": [5, 6],
         "huge_image": "..."
       },
       {
         "x": "some other complicated thing",
         "y": [1, 2, 3],
         "huge_image": "..."
       },
       ...          
    ]
}

It seems straightforward that I might want to import this as two collections, parents and children. (Indeed, I may have to if the children are extremely large documents, such as media.) Yet I cannot find any information on how to efficiently import existing nested data into MongoDB as multiple collections.

mongoimport takes only one collection argument. One can certainly import the data into one collection, then manually construct the second collection from the first and modify each entry of the first, but this seems both labor-intensive and inefficient for what surely must be a common problem.

Is there something I'm missing here?

David Bruce Borenstein
  • 1,655
  • 2
  • 19
  • 34
  • children is a object or array of object or array? – codeofnode Feb 22 '18 at 21:02
  • Thanks, that should have been an array of objects. Will fix. – David Bruce Borenstein Feb 22 '18 at 21:06
  • when you know that a children belongs to only one parent, why don;t you store children as array subdocument, and leverage the schemaless feature of mongodb? or else you might want to give thoughts to use other db. – codeofnode Feb 22 '18 at 21:08
  • It is common practice to represent subdocuments as referenced collections rather than embedded ones in MongoDB. In some cases, it's strictly necessary, as when the child documents would make the object too large for MongoDB. In other cases, it's because you sometimes want to access only the children, and need to write complex queries against them. See the first link, and also this: https://stackoverflow.com/questions/5373198/mongodb-relationships-embed-or-reference – David Bruce Borenstein Feb 22 '18 at 21:24
  • MongoDB does not provides references to other collection by default. It just store some value, and other tool can refer them by cost of additional queries. And by the How about data consistency? Say you want to remove parent 1, you always need additional query you would need to remove the children, to keep data consistent. Overall its the overhead both sides. If you don;t care about these, or having different requirements, of course you are the only one who need to take a call – codeofnode Feb 22 '18 at 21:28
  • Hey David, I've come across the exact same issue. Did you find a reasonable solution to this? It does feel like this is a common problem and there should be something built-in to handle it. – hello-klol Nov 06 '18 at 06:19
  • Hi Katie, unfortunately, my only approach was to do it manually. The upside of that is that it caused me to develop different strategies for different scenarios. In many cases, gridfs is my weapon of choice, and it has nice interfaces in Python (my primary language) for handling this. – David Bruce Borenstein Nov 06 '18 at 14:52

0 Answers0