Without seeing your code, the most probable cause of these bad performance results is due to the garbage collection activity. To demonstrate it, I wrote the following program:
import java.lang.management.ManagementFactory;
import java.util.*;
import java.util.concurrent.*;
public class TestMap {
// we assume NB_ENTITIES is divisible by NB_TASKS
static final int NB_ENTITIES = 20_000_000, NB_TASKS = 2;
static Map<String, String> map = new ConcurrentHashMap<>();
public static void main(String[] args) {
try {
System.out.printf("running with nb entities = %,d, nb tasks = %,d, VM args = %s%n", NB_ENTITIES, NB_TASKS, ManagementFactory.getRuntimeMXBean().getInputArguments());
ExecutorService executor = Executors.newFixedThreadPool(NB_TASKS);
int entitiesPerTask = NB_ENTITIES / NB_TASKS;
List<Future<?>> futures = new ArrayList<>(NB_TASKS);
long startTime = System.nanoTime();
for (int i=0; i<NB_TASKS; i++) {
MyTask task = new MyTask(i * entitiesPerTask, (i + 1) * entitiesPerTask - 1);
futures.add(executor.submit(task));
}
for (Future<?> f: futures) {
f.get();
}
long elapsed = System.nanoTime() - startTime;
executor.shutdownNow();
System.gc();
Runtime rt = Runtime.getRuntime();
long usedMemory = rt.maxMemory() - rt.freeMemory();
System.out.printf("processing completed in %,d ms, usedMemory after GC = %,d bytes%n", elapsed/1_000_000L, usedMemory);
} catch (Exception e) {
e.printStackTrace();
}
}
static class MyTask implements Runnable {
private final int startIdx, endIdx;
public MyTask(final int startIdx, final int endIdx) {
this.startIdx = startIdx;
this.endIdx = endIdx;
}
@Override
public void run() {
long startTime = System.nanoTime();
for (int i=startIdx; i<=endIdx; i++) {
map.put("sambit:rout:" + i, "C:\\Images\\Provision_Images");
}
long elapsed = System.nanoTime() - startTime;
System.out.printf("task[%,d - %,d], completed in %,d ms%n", startIdx, endIdx, elapsed/1_000_000L);
}
}
}
At the end of the processing, this code computes an approximation of the used memory by doing a System.gc()
immediately followed by Runtime.maxMemory() - Runtime.freeMemory()
. This shows that the map with 20 million entries takes approximately just under 2.2 GB, which is considerable. I have run it with 1 and 2 threads, for various values of the -Xmx and -Xms JVM arguments, here are the resulting outputs (just to be clear: 2560m = 2.5g):
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms2560m, -Xmx2560m]
task[0 - 19,999,999], completed in 11,781 ms
processing completed in 11,782 ms, usedMemory after GC = 2,379,068,760 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms2560m, -Xmx2560m]
task[0 - 9,999,999], completed in 8,269 ms
task[10,000,000 - 19,999,999], completed in 12,385 ms
processing completed in 12,386 ms, usedMemory after GC = 2,379,069,480 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms3g, -Xmx3g]
task[0 - 19,999,999], completed in 12,525 ms
processing completed in 12,527 ms, usedMemory after GC = 2,398,339,944 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms3g, -Xmx3g]
task[0 - 9,999,999], completed in 12,220 ms
task[10,000,000 - 19,999,999], completed in 12,264 ms
processing completed in 12,265 ms, usedMemory after GC = 2,382,777,776 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms4g, -Xmx4g]
task[0 - 19,999,999], completed in 7,363 ms
processing completed in 7,364 ms, usedMemory after GC = 2,402,467,040 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms4g, -Xmx4g]
task[0 - 9,999,999], completed in 5,466 ms
task[10,000,000 - 19,999,999], completed in 5,511 ms
processing completed in 5,512 ms, usedMemory after GC = 2,381,821,576 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms8g, -Xmx8g]
task[0 - 19,999,999], completed in 7,778 ms
processing completed in 7,779 ms, usedMemory after GC = 2,438,159,312 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms8g, -Xmx8g]
task[0 - 9,999,999], completed in 5,739 ms
task[10,000,000 - 19,999,999], completed in 5,784 ms
processing completed in 5,785 ms, usedMemory after GC = 2,396,478,680 bytes
These results can be summarized in the following table:
--------------------------------
heap | exec time (ms) for:
size (gb) | 1 thread | 2 threads
--------------------------------
2.5 | 11782 | 12386
3.0 | 12527 | 12265
4.0 | 7364 | 5512
8.0 | 7779 | 5785
--------------------------------
I also observed that, for the 2.5g and 3g heap sizes, there was a high CPU activity, with spikes at 100% during the whole processing time, due to the GC activity, whereas for 4g and 8g it is only observed at the end due to the System.gc()
call.
To conclude:
if your heap is sized inappropriately, the garbage collection will kill any performance gain you would hope to obtain. You should make it large enough to avoid the side effects of long GC pauses.
you must also be aware that using a concurrent collection such as ConcurrentHashMap
has a significant performance overhead. To illustrate this, I slightly modified the code so that each task uses its own HashMap
, then at the end all the maps are aggregated (with Map.putAll()
) in the map of the first task. The processing time fell to around 3200 ms