I have a huge dataset, I am using mongoose schemas, and each data element looks like this:
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
}
Source: Reading an FASTA file
As you can see, the individual elements are simple and small, but they are huge in number! Together, they will exceed 200MB.
The problem is: I cannot save it to mongo since it is too big (> 200MB).
I have found GridFs, nonetheless,
All the materials I have found so far talks about image and videos uploads;
They do not say how I could still use the mongoose schema capability;
The examples I have seen so far does not save the data into paths defined by the user, like we do with mongoose.
In the simplest scenario: how can I save a JSON file using GridFS, or any similar solution as I do with small JSON files. What are the pros and cons of this approach compared to other approaches, if any? Do you consider my approach valid? I mean, the one I have mentioned here, using a tree of JSON files and populate
later, it works!
As an example of saving a JSON file using mongoose:
Model.create([
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
},
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
}]);
Here I have just saved a two-element JSON file, I cannot do that with a huge one, I need to break into smaller pieces (chunks of say 1%), and create the tree just mentioned, at least that was my solution.
I am afraid I may be reinventing the wheels. I could save those files independently, and it works, but I need to keep them correlated, because they belong to the same file, like the smaller chunks of an image belongs to the same image.
I am afraid I may be reinventing the wheels.
Current solution
This is my current solution, using my own insights! See that I am mentioning here just for curiosity, it does not use GridFS, as so, I am still opened for suggestions using GridFS. It is using just JSON files, and breaking the document into smaller ones, in a level like hierarchy. It is a tree, and I just want the leaves in the solution.
I have solved the problem using this diagram, nonetheless, I want, for learning purposes, see if it is possible to do the same using GridFS.
Discussion
My first approach was to keep them as subdoc: it failed! then I have tried to keep just their ids, their ids correspend to 35% of the whole chunk, and it is bigger than 16MB: failed! then I have decided to create a dummy document, just to keep the ids, and stores just the id of the dummy documents: successes!