I have 40 million data in mongoDB. I am reading that data in parallel from collection, processing it and dumping into another collection.
Sample code for job initialization.
ExecutorService executor = Executors.newFixedThreadPool(10);
int count = total_number_of_records in reading collection
int pageSize = 5000;
int counter = (int) ((count%pageSize==0)?(count/pageSize):(count/pageSize+1));
for (int i = 1; i <= counter; i++) {
Runnable worker = new FinalParallelDataProcessingStrategyOperator(mongoDatabase,vendor,version,importDate,vendorId,i,securitiesId);
executor.execute(worker);
}
Each thread is doing following thing
public void run() {
try {
List<SecurityTemp> temps = loadDataInBatch();
populateToNewCollection(temps);
populateToAnotherCollection(temps);
} catch (IOException e) {
e.printStackTrace();
}
}
Load data is paginated by using following query
mongoDB.getCollection("reading_collection").find(whereClause).
.skip(pagesize*(n-1)).limit(pagesize).batchSize(1000).iterator();
Machine Configuration : 2 CPU with 1 core each
Parallel implementation is giving almost same performance as sequential. Stats on subset of data (319568 records)
No. of Threads Execution Time(minutes)
1 16
3 15
8 17
10 17
15 16
20 12
50 30
How to improve performance of this application?