0

I have a requirement where I need to parse JSON objects from a text file and persist them to MongoDB..

SOME DETAILS -

  1. File Size ~ 1-10 MB, #json objects ~ 100 k, size of a single json object is therefore quite small..
  2. Mongodb cluster (sharded and replicated)
  3. Performance - Time is at a premium..
  4. I cannot dump any object to my mongodb collection unless I parse and validate the whole file..
  5. My app uses J2EE stack (Spring 3.2)

So now I have a million Java objects that I need to store before doing bulk insert to mongodb..(mongodb is sharded.. so i have to pre split for better performance etc)

My question is how do I make this efficient? Some of the ways I thought of -

  1. Serialize and store objects to file. (Problem: IO time)
  2. Make a temp collection on a standalone non sharded mongo and then bulk insert to the required collection (Looks better than #1).

Can anyone share her experience for a similar problem..? Do let me know if any other info is needed..

abipc
  • 997
  • 2
  • 13
  • 35
  • 1
    why not just store it in RAM using lists or hashmaps? 10M is rather small amount of RAM – mvp Oct 23 '13 at 06:51
  • I can be parsing N number of files at a time.. clients keep uploading files and I need to do the same thing on N files.. If I am processing 10 files.. ~ 100 MB.. But u may be right.. I can give it a shot.. – abipc Oct 23 '13 at 06:54
  • 1
    That still should not be big deal. Just a matter of choosing appropriate data structures on Java side. – mvp Oct 23 '13 at 06:56
  • 1
    If you really have a million documents to insert times N users simultaneously, I'd guess that MongoDb will be the bottleneck as you slam it with those documents. – WiredPrairie Oct 23 '13 at 11:00

3 Answers3

3

Proposed in-memory solution is not good long-term solution as you will probably have to redesign your app once you will met customer with data which does not fit in the memory.

In RDBSM you should leverage the purpose of transaction. Just use stream approach, I mean load data, verify, and put into DB. If you would meet invalidated object, just rollback transaction and everything is ok. Depends whether it's possible to lock the data for potentially long time as RDBMS would typically lock the whole table and nobody could be able to read them.

Right now you solve the problem when you have lower consistency on NoSQL DB. The point is that you have to provide programming rollback of your data.

  1. You can use other DB e.g. Redis to store temporary data. As Redis has optional persistency you can benefit from large main memory and store the data to harddrive only if the memory size does not fits.
  2. OR You can provide bulk insert and mark data (e.g. by boolean flag) that they are not ready. Obviously queries on production data must avoid all with not ready flag
  3. Once you will use temporary table, it has many constraint as two same concurrent operations will affects themselves.

How I would design it?

Probably use one instance of mongo for this not ready data to avoid mutual affection and once you know that they can be move to production, just move them into correct table.

Martin Podval
  • 1,097
  • 1
  • 7
  • 16
  • @Martin .. this is as good as a correct answer.. understood wat u r trying to suggest.. giving it a shot based on points u mentioned above.. – abipc Oct 23 '13 at 09:15
  • 1
    Copy the data from one location to persist it again in another to only grab it again to finally save to disk via MongoDb? That's a lot of moving parts and complexity and data being copied. – WiredPrairie Oct 23 '13 at 11:03
1

Both the ways you have mentioned are fine. I suggest you to think this way too.

  1. As the file size is not to big you can have an array which will hold the objects.
  2. Once you validate an object push it to the array.
  3. At the time all the objects are validated you can insert them to mongoDB.
Jayram
  • 18,820
  • 6
  • 51
  • 68
0

I would go with RAM and a Map of direct ByteBuffers. In this case you're not limited to your heap RAM. And you can wrap you ByteBuffer with an InputStream to process - Wrapping a ByteBuffer with an InputStream. This way might be tricky and require experimenting, i.e. choose a proper buffer size to read from a ByteBuffer.

Community
  • 1
  • 1
Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68