I list code below which performs batch inserts into MongoDB. This code takes a long time to run--about an hour to insert 20 million Mongo documents.
The time consuming portion of the code--the scanner.hasNextLine()--nextLine()--insert loop--runs as slowly as 20 seconds an iteration at times. This slowness becomes pronounced toward the middle of the job, I note. (Answers on this forum indicate that the mongo insert-batch or regular can be expensive, owing to the conversion of json into binary format, bson.)
I want to speed up this process. I would like to process this job in parallel on several cores. Can I do this using Fork/join? I ask because I could not see a way how to apply a divide and conquer strategy in the case of this code with its while loop on an input stream.
Another possibility is to use the ThreadPoolExecutor. Would use of an executor be best? Would an executor distribute the job over several cores?
The code:
Scanner lineScan = new Scanner(inputStream, encoding);
while (lineScan.hasNextLine() {
//add to list of DBObjects to be inserted as a batch
//do batch insert here if object-count threshold is reached.
}
Similar code using a ThreadPoolExecutor ( see
Java Iterator Concurrency and Java: Concurrent reads on an InputStream ):
ExecutorService executor = Executors.newCachedThreadPool();
Iterator<Long> i = getUserIDs();
while (i.hasNext()) {
final Long l = i.next();
Runnable task = new Runnable() {
public void run() {
someObject.doSomething(l);
anotheObject.doSomething(l);
}
}
executor.submit(task);
}
executor.shutdown();
Any perspectives on which technique might best expedite this loop and insert would be greatly appreciated. Many thanks in advance!