13

I have more than 10 million JSON documents of the form :

["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]

in one file.

Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):

public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
    // open file
    FileInputStream fstream = null;
    try {
        fstream = new FileInputStream(pathToFile);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        System.out.println("file not exist, exiting");
        return;
    }
    BufferedReader br = new BufferedReader(new InputStreamReader(fstream));

    // read it line by line
    String strLine;
    DBCollection newColl =   db.getCollection(collectionName);
    try {
        while ((strLine = br.readLine()) != null) {
            // convert line by line to BSON
            DBObject bson = (DBObject) JSON.parse(JSONstr);
            // insert BSONs to database
            try {
                newColl.insert(bson);
            }
            catch (MongoException e) {
              // duplicate key
              e.printStackTrace();
            }


        }
        br.close();
    } catch (IOException e) {
        e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
    }


}

Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion. Thanks.

rok
  • 9,403
  • 17
  • 70
  • 126
  • 1
    Have you attempted to parse more than one line at a time? This could decrease the total overhead spent initializing your JSON parser. Additionally, you may look into alternative BSON/JSON parsers. Jackson (http://jackson.codehaus.org/) is known for being extremely fast and I believe now has native BSON support. It is also customizable so there may be some features of the parser you could remove/optimize. – drobert Oct 28 '13 at 15:34
  • One other thing: you may want to separate the parsing of json from the saving of the parsed result. That is, this seems like a straightforward producer/consumer problem where one thread could be reading/parsing JSON and adding to a queue while the other thread pulls from the queue and *batch* inserts into the db. I would imagine the 'one insert at a time' approach is your slowest part, but it's difficult to know without profiling. – drobert Oct 28 '13 at 15:35
  • 1
    Maybe you should try putting the actual parsing in a different thread. Perhaps you could use a `Executors.newFixedThreadPool(n)` where `n` is the number of threads that can be running at 1 given time. – Josh M Oct 28 '13 at 15:36

7 Answers7

10

I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.

The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.

I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.

  • 1
    Totally agree, JSON.parse is pretty terrible and doesn't appear to be threadsafe from some experiments in trying to convert in parallel. – pjp Nov 08 '15 at 16:40
5

I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.

Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:

  1. Open the file as BINARY (not TEXT!)
  2. Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
  3. Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
  4. Parse each chunk with a thread.

A reminder:

when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.

Yadli
  • 381
  • 2
  • 12
4

You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:

1)Parse the file and retrieve the json Object.

2)Once the parsing is over, save the json Object in the Mongo Document.

Jhanvi
  • 5,069
  • 8
  • 32
  • 41
3

I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with

insert(List<DBObject> list)

http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)

That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/ for info

Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.

Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.

tom
  • 2,704
  • 16
  • 28
2

You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.

evanchooly
  • 6,102
  • 1
  • 16
  • 23
2

Use bulk operations insert/upserts. After Mongo 2.6 you can do Bulk Updates/Upserts. Example below does bulk update using c# driver.

MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
      var bulk = collection.InitializeUnorderedBulkOperation();
      foreach (FooDoc fooDoc in fooDocsList)
      {
        var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
        bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
      }
      BulkWriteResult bwr =  bulk.Execute();
PUG
  • 4,301
  • 13
  • 73
  • 115
0

You can use a bulk insertion

You can read the documentation at mongo website and you can also check this java example on StackOverflow

Community
  • 1
  • 1
  • 3
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Leistungsabfall Oct 02 '14 at 20:34