4

I have a huge dataset, I am using mongoose schemas, and each data element looks like this:

    {
      field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
      field2: “GAA…..GAATG”

    }

Source: Reading an FASTA file

As you can see, the individual elements are simple and small, but they are huge in number! Together, they will exceed 200MB.

The problem is: I cannot save it to mongo since it is too big (> 200MB).

I have found GridFs, nonetheless,

  • All the materials I have found so far talks about image and videos uploads;

  • They do not say how I could still use the mongoose schema capability;

  • The examples I have seen so far does not save the data into paths defined by the user, like we do with mongoose.

In the simplest scenario: how can I save a JSON file using GridFS, or any similar solution as I do with small JSON files. What are the pros and cons of this approach compared to other approaches, if any? Do you consider my approach valid? I mean, the one I have mentioned here, using a tree of JSON files and populate later, it works!

As an example of saving a JSON file using mongoose:

Model.create([        
          {
          field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
          field2: “GAA…..GAATG”

        }, 
        {
          field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
          field2: “GAA…..GAATG”

        }]);

Here I have just saved a two-element JSON file, I cannot do that with a huge one, I need to break into smaller pieces (chunks of say 1%), and create the tree just mentioned, at least that was my solution.

I am afraid I may be reinventing the wheels. I could save those files independently, and it works, but I need to keep them correlated, because they belong to the same file, like the smaller chunks of an image belongs to the same image.

I am afraid I may be reinventing the wheels.

Current solution

This is my current solution, using my own insights! See that I am mentioning here just for curiosity, it does not use GridFS, as so, I am still opened for suggestions using GridFS. It is using just JSON files, and breaking the document into smaller ones, in a level like hierarchy. It is a tree, and I just want the leaves in the solution.

enter image description here

I have solved the problem using this diagram, nonetheless, I want, for learning purposes, see if it is possible to do the same using GridFS.

Discussion

My first approach was to keep them as subdoc: it failed! then I have tried to keep just their ids, their ids correspend to 35% of the whole chunk, and it is bigger than 16MB: failed! then I have decided to create a dummy document, just to keep the ids, and stores just the id of the dummy documents: successes!

  • no one :) I am solving using my own strategy! but I am still opened to learn! – Jorge Guerra Pires Apr 01 '20 at 22:07
  • 1
    let me know if my answer is helpful to you – Codebling Apr 05 '20 at 19:31
  • Hey there, thanks for your answer. Your answer made me realize that things are more complicated than I have thought, I have solved the problem using my own way, and it is taking seconds to retrieve, but 9 minutes to save; for now I am happy with this solution. The answer you sent me to is 5 years old. I am glad you replied, and it gave me insights, but I was waiting for something more concrete. Maybe I was wanting too much, if no one else appears, I will accept your answer and give you bounty, no need to worry; it may compensate you. – Jorge Guerra Pires Apr 05 '20 at 20:16
  • 1
    Yes, I nearly made a comment about the fact that it is 5 years old..nonetheless, I do not think that GridFS has changed in that regard. The fact that the retrieval/storage time is proportional to the size of the document has to do with the design of GridFS, the way it splits data into chunks. So 5 years later nothing has changed on that front, as far as I know. – Codebling Apr 05 '20 at 20:30
  • 1
    Hopefully you will get other answers that are more insightful! – Codebling Apr 05 '20 at 20:30
  • I have updated the question. One issue you have not addressed in your answer, and it is the key of the question, is how to **save a user-defined JSON file**: mine comes from a set of smaller JSON files. I saw several examples in the net for saving videos and image, but what about a user-designed JSON, and still making use of mongoose schemas? – Jorge Guerra Pires Apr 05 '20 at 22:11
  • You save a JSON file in GridFS the way you would save any other file. But it also behaves just like any other file...you **cannot** query it. The data is opaque. Sorry for the confusion, I thought you were only trying to save the FASTA data in GridFS. – Codebling Apr 05 '20 at 22:20
  • what do youi mean by *opaque*? – Jorge Guerra Pires Apr 05 '20 at 22:21
  • "You save a JSON file in GridFS the way you would save any other file." can you transfom it into an answer? – Jorge Guerra Pires Apr 05 '20 at 22:23
  • By "opaque", I mean that Mongo doesn't know or care what data is in the file, it only knows that a file is stored there. I can give an example of saving a file, yes. – Codebling Apr 05 '20 at 22:24
  • You mean I cannot use mongoose schema? that is, the data is just saved, but the structure lost. – Jorge Guerra Pires Apr 05 '20 at 22:25
  • Neithe i can use `find()`...`save()`, I mean, all the built-in functions from mongoose – Jorge Guerra Pires Apr 05 '20 at 22:26
  • Correct. GridFS is for storing files, not documents. – Codebling Apr 05 '20 at 22:27
  • "GridFS is for storing files, not documents." does that mean I lose all the documents related advantages of Mongo? – Jorge Guerra Pires Apr 05 '20 at 22:28
  • 1
    You can store a file in Mongo using GridFS. If you use GridFS, regardless of what type of data is in the file, you will not be able to query it. You can not use `find()`, `save()`, or any other Collection methods to access data in a file saved with GridFS. You **can** still use `find()` and other Collection data to query/access the GridFS-stored file's *metadata*, which contains the file size, the file name, the number of chunks, and any other data you wish to save with the file. You can still use Collection methods on any regular documents (which are not GridFS files) – Codebling Apr 05 '20 at 22:35
  • 1
    Thanks, now things seem more clear. I will try to test your insights soon, the last time I have tried, I did not succeed. The best wayt to learn is coding! Thanks. – Jorge Guerra Pires Apr 05 '20 at 22:38
  • "It's very very likely not worth storing the data in Mongo using GridFS." isn't it GridFS an offiicial release from mongo? – Jorge Guerra Pires Apr 05 '20 at 22:39
  • 1
    Yes is an official release from Mongo. As you cannot query, I'm not sure what the advantage would be. There are better tools. – Codebling Apr 06 '20 at 01:31

2 Answers2

1

It's very very likely not worth storing the data in Mongo using GridFS.

Binary data never really belongs in a database, but if the data is small, the benefits of putting it in the database (ability to query) outweigh the drawbacks (server load, slow).

In this case, it looks like you'd like to store document data (JSON) in GridFS. You may do this, and store it the way you would store any other binary data. The data, however, will be opaque. You cannot query JSON data stored in a GridFS document, only the file metadata.

Querying big data

As you mentioned that you wanted query the data, you should check the format of your data. If your data is in the format listed in the example, then it seems like there is no need for complicated queries, only string matching. So there are several options.

Case 1: Large Data, Few Points

If you have not many data sets (pairs of field1 and field2) but the data for each one is large (field2 contains many bytes), store these elsewhere and store only a reference to that. A simple solution would be to store the data (formerly field2) in a text file on Amazon S3 and store then store the link. e.g.

{
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”,
  field2link: "https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png"
}

Case 2: Small Data, Many Points

If your each data set is small (less than 16 MB) but there are many data sets, store your data in MongoDB (without GridFS).

Specifics

In your case, the data is quite large and storing it using GridFS is inadvisable.

This answer provides a benchmark towards to bottom. The benchmark seems to indicate that the retrieval time is more or less directly proportional to the file size. With the same setup, it would take 80 seconds to retrieve a document from the database.

Possible optimisations

The default chunk size in GridFS is 255 KiB. You may be able to reduce large file access times by increasing the chunk size to the maximum (16 MB). If the chunk size is the only bottleneck, then using the 16 MB chunk size would reduce the retrieval time from 80 seconds to 1.3 seconds (80 / (16MB/255KiB) = 1.3). You can do this when initialising the GridFS bucket.

new GridFSBucket(db, {chunkSizeBytes: 16000000})

A better strategy would be to store the only file name in Mongo and retrieve the file from the filesystem instead.

Other drawbacks

Another possible drawback of storing the binary data in Mongo comes from this site: "If the binary data is large, then loading the binary data into memory may cause frequently accessed text (structured data) documents to be pushed out of memory, or more generally, the working set might not fit into RAM. This can negatively impact the performance of the database." [1]

Example

Saving a file in GridFS, adapted from the Mongo GridFS tutorial

const uri = 'mongodb://localhost:27017/test';

mongodb.MongoClient.connect(uri, (error, db) => {
  const bucket = new mongodb.GridFSBucket(db);

  fs.createReadStream('./fasta-data.json')
    .pipe(bucket.openUploadStream('fasta-data.json'))
    .on('finish', () => console.log('done!'))
  ;
});
Codebling
  • 10,764
  • 2
  • 38
  • 66
  • "A better strategy would be to store the only file name in Mongo and retrieve the file from the filesystem instead." you mean saving the file normally, like any file we save daily, and retrieve it using the link? – Jorge Guerra Pires Apr 05 '20 at 22:42
  • Save it on something that provides redundancy and is accessible from the same places that your server is. Amazon S3 is a good option. I've updated the answer – Codebling Apr 06 '20 at 01:48
  • Hey here, I have seen you have updated the answer. My situation seems to be close2 to **case 2**; the documents themselves are small, just 2-4 fields; I can save them independently, no problem, takes about 9 minutes. But…I need to somehow connect them. My first solution was to keep their individual ids, but too big as well! Then I have decided to do like a tree: keep the id of a dummy document that keep stheir ids. “If your each data set is small (less than 16 MB) but there are many data sets, store your data in MongoDB (without GridFS).” Can you explain this better? – Jorge Guerra Pires Apr 06 '20 at 15:38
  • @JorgePires small means **bytes**, not number of fields. Your example data has only 2 fields and does not indicate size in bytes of each field. Please indicate how many data points/sets you have (size of array, as in the example posted in the question) and the minimum and maximum size in bytes of total data per data point/set – Codebling Apr 06 '20 at 16:21
  • I do not have this information, it is not fixed. I just know the fields, the size of each field I have no idea. I could make an estimation, but I believe it is meaningless since I cannot control that in a real scenario. – Jorge Guerra Pires Apr 06 '20 at 16:24
  • 1
    "If your each data set is small (less than 16 MB) but there are many data sets, store your data in MongoDB (without GridFS)." I believe that is what I did! – Jorge Guerra Pires Apr 06 '20 at 16:25
1

I have found a better way to solve this problem than the one I have implemented, the one in the question description. I just need to use Virtuals!

First I thought that using ForEach for adding an extra element to the Fasta file would be slow, it is not, it is pretty fast!

I can do something like this for each Fasta file:

{
  Parentid: { type: mongoose.Schema.Types.ObjectId, ref: "Fasta" }//add this new line with its parent id
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
  field2: “GAA…..GAATG”

}

Then something like this:

FastaSchema.virtual("healthy", {
  ref: "FastaElement",
  localField: "_id",
  foreignField: "parent",
  justOne: false,
});

Finally populates:

  Fasta.find({ _id: ObjectId("5e93b9b504e75e5310a43f46") })
    .populate("healthy")
    .exec(function (error, result) {          
      res.json(result);
    });

And the magic is done, no problem with subdocument overload! Populate applied to Virtual is pretty fast and causes no overload! I have not done that, but it would interesting to compare with conventional populate; however, this approach has the advantage of no need to create hidden doc to store the ids.

I am speechless with this simple solution, that came up when I was answering another question here, and it just came up!

Thanks to mongoose!