Store a million Java objects temporarily before doing bulk insert to mongodb

Question

I have a requirement where I need to parse JSON objects from a text file and persist them to MongoDB..

SOME DETAILS -

File Size ~ 1-10 MB, #json objects ~ 100 k, size of a single json object is therefore quite small..
Mongodb cluster (sharded and replicated)
Performance - Time is at a premium..
I cannot dump any object to my mongodb collection unless I parse and validate the whole file..
My app uses J2EE stack (Spring 3.2)

So now I have a million Java objects that I need to store before doing bulk insert to mongodb..(mongodb is sharded.. so i have to pre split for better performance etc)

My question is how do I make this efficient? Some of the ways I thought of -

Serialize and store objects to file. (Problem: IO time)
Make a temp collection on a standalone non sharded mongo and then bulk insert to the required collection (Looks better than #1).

Can anyone share her experience for a similar problem..? Do let me know if any other info is needed..

why not just store it in RAM using lists or hashmaps? 10M is rather small amount of RAM — mvp, Oct 23 '13 at 06:51
I can be parsing N number of files at a time.. clients keep uploading files and I need to do the same thing on N files.. If I am processing 10 files.. ~ 100 MB.. But u may be right.. I can give it a shot.. — abipc, Oct 23 '13 at 06:54
That still should not be big deal. Just a matter of choosing appropriate data structures on Java side. — mvp, Oct 23 '13 at 06:56
If you really have a million documents to insert times N users simultaneously, I'd guess that MongoDb will be the bottleneck as you slam it with those documents. — WiredPrairie, Oct 23 '13 at 11:00

score 3 · Answer 1 · answered Oct 23 '13 at 07:22

Proposed in-memory solution is not good long-term solution as you will probably have to redesign your app once you will met customer with data which does not fit in the memory.

In RDBSM you should leverage the purpose of transaction. Just use stream approach, I mean load data, verify, and put into DB. If you would meet invalidated object, just rollback transaction and everything is ok. Depends whether it's possible to lock the data for potentially long time as RDBMS would typically lock the whole table and nobody could be able to read them.

Right now you solve the problem when you have lower consistency on NoSQL DB. The point is that you have to provide programming rollback of your data.

You can use other DB e.g. Redis to store temporary data. As Redis has optional persistency you can benefit from large main memory and store the data to harddrive only if the memory size does not fits.
OR You can provide bulk insert and mark data (e.g. by boolean flag) that they are not ready. Obviously queries on production data must avoid all with not ready flag
Once you will use temporary table, it has many constraint as two same concurrent operations will affects themselves.

How I would design it?

Probably use one instance of mongo for this not ready data to avoid mutual affection and once you know that they can be move to production, just move them into correct table.

@Martin .. this is as good as a correct answer.. understood wat u r trying to suggest.. giving it a shot based on points u mentioned above.. — abipc, Oct 23 '13 at 09:15
Copy the data from one location to persist it again in another to only grab it again to finally save to disk via MongoDb? That's a lot of moving parts and complexity and data being copied. — WiredPrairie, Oct 23 '13 at 11:03

score 1 · Answer 2 · answered Oct 23 '13 at 06:51

Both the ways you have mentioned are fine. I suggest you to think this way too.

As the file size is not to big you can have an array which will hold the objects.
Once you validate an object push it to the array.
At the time all the objects are validated you can insert them to mongoDB.

score 0 · Answer 3 · edited May 23 '17 at 12:21

0

I would go with RAM and a Map of direct ByteBuffers. In this case you're not limited to your heap RAM. And you can wrap you ByteBuffer with an InputStream to process - Wrapping a ByteBuffer with an InputStream. This way might be tricky and require experimenting, i.e. choose a proper buffer size to read from a ByteBuffer.

edited May 23 '17 at 12:21

Community

1
1

answered Oct 23 '13 at 08:12

Andrey Chaschev

16,160
5
51
68

Store a million Java objects temporarily before doing bulk insert to mongodb

3 Answers3