Scalability issue with HashMap in Multithreaded Multicore system

Question

I am facing scalability issue while reading data from hashmap. My machine has 32 core with 2 hyper thread per core (so total 64 cpus) and 64 GB RAM. When reading data from HashMap and doing arithmetic calculation I am seeing a performance dip from 16 thread onwards, but while doing only arithmetic operation it's scaling as per expectation.

Please find the test result below:

Reading from HashMap and performing arithematic operation:

no of threads | Time Taken (seconds) => 1 | 85, 2 | 93, 4 | 124, 8 | 147, 16 | 644

Performing only arithematic operations :

no of threads | Time Taken (seconds) => 1 | 25, 2 | 32, 4 | 35, 8 | 41, 16 | 65, 32 | 108, 40 | 112, 64 | 117, 100 | 158

Also adding the code block for reference :

import java.util.*;

import java.util.concurrent.*;

import java.lang.*;

public class StringCallable2
{

//  private static final long   size    = 500000L;
    private static final long   size    = 1000000L;
//  private final static HashMap <Long,Long>map = new HashMap<Long, Long>();

//  private static long[] array = new long[(int) size];
    public static class StringGenCallable implements Callable
    {
        int count;
        public StringGenCallable(int count)
        {
            this.count = count;
        }

        public Long call()
        {

            //Random rand = new Random();
//          System.out.println("Thread " + count + " started test");
            long sum = 20;
            // do a CPU intensive arithmetic operation; no Input Output
            // operations, object creations or floating point arithmetic

            for (long i = 0; i < size; i++)
            {
                //int numNoRange = rand.nextInt((int)(size-1));
                //long numNoRange = i;
                // Long long1 = map.get((long)i);
                //Long long1 = array[(int)i];
                sum = i + 19 * sum;
            }
//          System.out.println("Finished " + count);

            return sum;
        }
    }

    public static void main(String args[]) 
    {
        try
        {
        System.out.println("Starting");
        // for (long i = 0; i < size; i++)
        // {
            //array[(int)i] = System.currentTimeMillis();
        //  map.put(i, System.currentTimeMillis());
        // }
        int sizt = Integer.valueOf(args[0]);
        long curtime = System.currentTimeMillis();
        ExecutorService pool = Executors.newFixedThreadPool(sizt);
        Set<Future<Integer>> set = new HashSet<Future<Integer>>();
        for (int i = 0; i < sizt; i++)
        {
            Callable<Integer> callable = new StringGenCallable(i);
            Future<Integer> future = pool.submit(callable);
            set.add(future);
        }

        long sum = 0;
        for (Future<Integer> future : set)
        {
            future.get();
        }

        System.out.println("Number of threads : "+sizt);
        long finsihtime = System.currentTimeMillis();
        System.out.println("Total Time Taken : " + (finsihtime - curtime)+" ms");
        pool.shutdown();
        // System.exit(sum);
        }
        catch (Exception e) {
            // TODO: handle exception
            e.printStackTrace();
        }
        catch (Error e) {
            // TODO: handle exception
            e.printStackTrace();
        }
        catch (Throwable e) {
            // TODO: handle exception
            e.printStackTrace();
        }
    }

}

And what's the question here? It's well known fact that lock contention hurts scalability. Anyway, in your case you can try `ConcurrentHashMap` which is optimized for multithreaded usage. — Victor Sorokin, Nov 08 '11 at 12:26
If you are using java5+, then try java.util.ConcurrentHashMap, this class is more suitable — Dhananjay, Nov 08 '11 at 12:26

score 6 · Answer 1 · edited May 23 '17 at 10:28

For an application with this level of multiprocessing you should be using ConcurrentHashMap. I would redesign to incorporate that change, and then revisit the performance.

I would also think carefully about how many threads you can effectively use. It's easy to view 'add more threads' as a performance panacea, and it's not. You may get more improvement by limiting the thread count and making currently-shared data structures into ThreadLocal, to reduce data sharing and the resulting contention and context switching.

In this example, even assuming you own the entire box for this process, having > 64 threads will make the process run increasingly slower, since the work-items are purely CPU-bound.

In a real world application, the unit of work would likely be a lot more complicated or long-running than what you have here. Be cautious about drawing too many conclusions from what is for your hardware a fairly trivial per-thread unit of work. The point is that relative to more complex workload, the thread management overhead here is amplified versus the executed work. In a more complex workload, the visible effect of lookup in the HashMap may tend to disappear and performance look more like what you would expect.

score 0 · Answer 2 · answered Nov 08 '11 at 13:02

From the look of your commented out code, it looks like the high overhead is from autoboxing. For each map.get((long)i) you are likely to be allocating a new Long object. Allocation is fast, but not that fast.

This applies whether you have one thread or many. However, for many threads it's likely the memory bandwidth is more important than the CPU.

(Implementations of Long.valueOf are allowed to return the same Long instance for the same value, which is likely for small values. Also an application of "escape analysis" could remove the Long from the heap.)

Peter Lawrey · Answer 3 · 2011-11-08T13:30:14.307

At first I suspected it was because HashMap case creates an object on every lookup.

However after testing (see below) I believe the problem is that it becomes increasing difficult to get efficient access to the caches.

import gnu.trove.TLongLongHashMap;

import java.util.HashMap;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
 * @author peter.lawrey
 */
public class HashMapPerfMain {
    public static final int REPEATS = 10000;

    public static void main(String... args) throws InterruptedException {
        int runLength = 10 * 1000;
        HashMap<Long, Long> hashMap = new HashMap<Long, Long>();
        TLongLongHashMap troveMap = new TLongLongHashMap();
        long[] array = new long[runLength];
        for (long i = 0; i < runLength; i++) {
            long now = System.nanoTime();
            hashMap.put(i, now);
            troveMap.put(i, now);
            array[((int) i)] = now;
        }

        for (int i = 0; i < 3; i++) {
            timeHashMap(hashMap);
            timeTroveMap(troveMap);
            timeArray(array);
        }
    }

    private static void timeHashMap(final HashMap<Long, Long> map) throws InterruptedException {
        System.out.printf("%-16s ", map.getClass().getSimpleName());
        for (int t = 1; t <= Runtime.getRuntime().availableProcessors(); t *= 2) {
            long start = System.nanoTime();
            ExecutorService es = Executors.newFixedThreadPool(t);
            for (int i = 0; i < t * REPEATS; i++)
                es.submit(new Callable<Long>() {
                    @Override
                    public Long call() throws Exception {
                        long sum = 20;
                        for (long key = 0; key < map.size(); key++)
                            sum = sum * 19 + map.get(key);
                        return sum;
                    }
                });
            es.shutdown();
            es.awaitTermination(10, TimeUnit.MINUTES);
            long time = System.nanoTime() - start;
            System.out.printf("%d | %.3f ", t, time / 1e9);
        }
        System.out.println();
    }

    private static void timeTroveMap(final TLongLongHashMap map) throws InterruptedException {
        System.out.printf("%-16s ", map.getClass().getSimpleName());
        for (int t = 1; t <= Runtime.getRuntime().availableProcessors(); t *= 2) {
            long start = System.nanoTime();
            ExecutorService es = Executors.newFixedThreadPool(t);
            for (int i = 0; i < t * REPEATS; i++)
                es.submit(new Callable<Long>() {
                    @Override
                    public Long call() throws Exception {
                        long sum = 20;
                        for (long key = 0; key < map.size(); key++)
                            sum = sum * 19 + map.get(key);
                        return sum;
                    }
                });
            es.shutdown();
            es.awaitTermination(10, TimeUnit.MINUTES);
            long time = System.nanoTime() - start;
            System.out.printf("%d | %.3f ", t, time / 1e9);
        }
        System.out.println();
    }

        private static void timeArray(final long [] array) throws InterruptedException {
            System.out.printf("%-16s ", array.getClass().getSimpleName());
        for (int t = 1; t <= Runtime.getRuntime().availableProcessors(); t *= 2) {
            long start = System.nanoTime();
            ExecutorService es = Executors.newFixedThreadPool(t);
            for (int i = 0; i < t * REPEATS; i++)
                es.submit(new Callable<Long>() {
                    @Override
                    public Long call() throws Exception {
                        long sum = 20;
                        for (int key = 0; key < array.length; key++)
                            sum = sum * 19 + array[key];
                        return sum;
                    }
                });
            es.shutdown();
            es.awaitTermination(10, TimeUnit.MINUTES);
            long time = System.nanoTime() - start;
            System.out.printf("%d | %.3f ", t, time / 1e9);
        }
        System.out.println();
    }
}

prints

HashMap          1 | 0.904 2 | 0.863 4 | 0.913 8 | 1.832 
TLongLongHashMap 1 | 0.568 2 | 0.566 4 | 0.572 8 | 1.048 
long[]           1 | 0.092 2 | 0.091 4 | 0.090 8 | 0.093 
HashMap          1 | 0.767 2 | 0.773 4 | 0.912 8 | 1.833 
TLongLongHashMap 1 | 0.560 2 | 0.563 4 | 0.570 8 | 1.057 
long[]           1 | 0.088 2 | 0.089 4 | 0.090 8 | 0.096 
HashMap          1 | 0.758 2 | 0.774 4 | 0.911 8 | 1.828 
TLongLongHashMap 1 | 0.565 2 | 0.564 4 | 0.568 8 | 1.056 
long[]           1 | 0.088 2 | 0.089 4 | 0.090 8 | 0.093

array access is very efficient as it linearly scans the memory. HashMaps tend to have the data pseudo randomly arranged in memory, putting a greater burden on the caches.

Scalability issue with HashMap in Multithreaded Multicore system

3 Answers3

Linked