21

I have a list of 1 million objects, and I need to populate that into a Map. Now, I want to reduce the time for populating this into a Map, and for this I am planning on using Java 8 parallelstream() like this:

List<Person> list = new LinkedList<>();
Map<String, String> map = new HashMap<>();
list.parallelStream().forEach(person ->{
    map.put(person.getName(), person.getAge());
});

I want to ask is it safe to populate a Map like this through parallel threads. Isn't it possible to have concurrency issues, and some data may get lost in the Map ?

OneMoreError
  • 7,518
  • 20
  • 73
  • 112
  • 4
    The HashMap is not thread safe, and writing with multiple threads in the same Map will probably produce concurrency issue. You should use the ConcurrentHashMap to do that I think. – Cédric O. Oct 25 '16 at 10:57
  • 2
    Here's a nice [piece](http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html) about the dangers of using `HashMap` in parallel. Obviously the race condition described in there might not exist in later versions of Java, but the general message is still valid. – biziclop Oct 25 '16 at 10:58
  • 2
    Race conditions aside, on the whole this is doesn't strike me as something that would benefit from parallelisation. How long does it take to populate the map sequentially (I image you've already measured that?) and how fast do you need it to be? – NPE Oct 25 '16 at 11:00
  • 2
    Same issue as with `ArrayList` http://stackoverflow.com/q/39453436/1743880 – Tunaki Oct 25 '16 at 11:01
  • @NPE It is taking currently 8 mins for 3 million records. I want to get it down to 2-3 mins atleast. – OneMoreError Oct 25 '16 at 11:04
  • 3
    Have a look at Collectors.toConcurrentMap (or groupingByConcurrent). – GPI Oct 25 '16 at 11:07
  • 2
    8 minutes to iterate over 3 million objects and stick them into a hashmap? I don't know what sort of hardware the code is running on, but this does seem pretty slow. May I suggest that you double-check that you're timing the right thing? – NPE Oct 25 '16 at 11:16
  • 2
    Use `list.parallelStream().collect(Collectors.toMap(Person::getName, Person::getAge))`. –  Oct 25 '16 at 11:20
  • 3
    Your bottleneck is probably somewhere else, possibly in producing the input data, or in your `equals` or `hashCode` methods. The only possible explanation where the map operation itself is the bottleneck would be a maliciously bad hashcode (always returning 1, for example). Once again, it wouldn't be `HashMap`'s fault. – Marko Topolnik Oct 25 '16 at 11:35
  • 2
    If it's taking 8 minutes, it's not the time to add the entries to a map which is taking the time. It should only take a second so you should concentrate on the code which is taking the time and speeding that up. – Peter Lawrey Oct 25 '16 at 11:55
  • 1
    http://ideone.com/dEoUhq took 0.6 seconds to add 3 million entries single threaded. – Peter Lawrey Oct 25 '16 at 11:59

2 Answers2

28

It is very safe to use parallelStream() to collect into a HashMap. However, it is not safe to use parallelStream(), forEach and a consumer adding things to a HashMap.

HashMap is not a synchronized class, and trying to put elements in it concurrently will not work properly. This is what forEach will do, it will invoke the given consumer, which puts elements into the HashMap, from multiple threads, possibly at the same time. If you want a simple code demonstrating the issue:

List<Integer> list = IntStream.range(0, 10000).boxed().collect(Collectors.toList());
Map<Integer, Integer> map = new HashMap<>();
list.parallelStream().forEach(i -> {
    map.put(i, i);
});
System.out.println(list.size());
System.out.println(map.size());

Make sure to run it a couple of times. There's a very good chance (the joy of concurrency) that the printed map size after the operation is not 10000, which is the size of the list, but slightly less.

The solution here, as always, is not to use forEach, but to use a mutable reduction approach with the collect method and the built-in toMap:

Map<Integer, Integer> map = list.parallelStream().collect(Collectors.toMap(i -> i, i -> i));

Use that line of code in the sample code above, and you can rest assured that the map size will always be 10000. The Stream API ensures that it is safe to collect into a non-thread safe container, even in parallel. Which also means that you don't need to use toConcurrentMap to be safe, this collector is needed if you specifically want a ConcurrentMap as result, not a general Map; but as far as thread safety is concerned with regard to collect, you can use both.

Community
  • 1
  • 1
Tunaki
  • 132,869
  • 46
  • 340
  • 423
  • 1
    Yeah, there's a very good chance that the printed map size is not 10000, but there’s also a good chance of running into an infinite loop or getting a spurious exception. The worst thing that can happen, is, that it could appear to to the right thing for this particular test run… – Holger Oct 25 '16 at 13:48
9

HashMap isn't threadsafe, but ConcurrentHashMap is; use that instead

Map<String, String> map = new ConcurrentHashMap<>();

and your code will work as expected.


Performance comparison of forEach() vs toMap()

After JVM warm-up, with 1M elements, using parallel streams and using median timings, the forEach() version was consistently 2-3 times faster than the toMap() version.

Results were consistent between all-unique, 25% duplicate and 100% duplicate inputs.

Community
  • 1
  • 1
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 1
    Although your answer is factually correct, it is not the recommended approach. – assylias Oct 25 '16 at 11:59
  • @assylias well the question was *s it safe to populate a Map like this*. Do you happen to know how this performs compared to parallel stream reduction to a map? – Bohemian Oct 25 '16 at 12:01
  • 1
    no I don't but I don't see how your suggestion could be better - at best it performs similarly because it is the approach that the JDK engineers have chosen because it performs best. More likely they have found a solution that incurs less lock collision. Would be interesting to test though. – assylias Oct 25 '16 at 12:04
  • @assylias performance results added, in short, `forEach()` 2-3 times faster than reduction. – Bohemian Oct 25 '16 at 13:38
  • 3
    When you print performance comparisons, you should also post *what* you are comparing. Most notable, a vanilla `toMap` used with “25% duplicate” would fail with an exception, rather than producing a comparable result. This suggests that you used an unspecified merge function, which obviously is not what the `forEach` approach does. Besides, there’s a fundamental difference between `toMap` and `toConcurrentMap`… – Holger Oct 25 '16 at 13:53
  • @Holger I used the identical effect `(a, b) -> b` as the merge function, which IMHO is the fairest. – Bohemian Oct 25 '16 at 13:56
  • 2
    @Bohemian: still, this implies using `Map.merge` rather than `Map.put`, which makes a difference. Further, `forEach` is an unordered operation, so you may use `.unordered().collect(toMap(…))` to achieve a similar effect. But, as said, `toMap` is a fundamentally different operation than `toConcurrentMap`. If you don’t have any performance relevant upstream operations, but still want parallel operations, `toConcurrentMap` is the better choice (much like the `forEach` to `ConcurrentMap` approach), though it’s very likely, that a single threaded operation would be even more efficient in this case. – Holger Oct 25 '16 at 13:59
  • 2
    @assylias: `toMap`, similarly to `toList`, never benefits from parallel operations, as the merging cost is as high as any potential benefit from the previous parallel processing. They are only useful in parallel streams, when the upstream operations benefit from contention-free parallel processing. For benchmarks, solely consisting of the collect operation without any useful stream processing, they will always fail. – Holger Oct 25 '16 at 14:03
  • 1
    Though it’s hard to believe, that the same result arose with `100% duplicate input`, as that implies 100% contention for the `ConcurrentMap`, but 0% merging costs for the `toMap` operation. That’s implausible, but as said, any posted benchmark result without showing the actua lcode under test is worthless… – Holger Oct 25 '16 at 14:11
  • 1
    @Holger "100% duplicates" means every element of the 1M was duplicated ONCE, ie 500K distinct elements. I'll post the code a little later and let you know when I do – Bohemian Oct 25 '16 at 15:03
  • 2
    @Bohemian Make sure you include the simple for loop variant in your benchmark, as a base-line, because with the benchmark I made with an "empty" pipeline (just collecting a list of integers into a map), the for loop was faster than all other solutions combined (and indeed, the `forEach` + `ConcurrentMap` was also a bit faster than `collect(toMap())`). This really depends on the operations that happens before collecting. – Tunaki Oct 25 '16 at 17:12
  • 3
    Because I made a second benchmark with a simple method before the collecting doing some work (String manipulation on the input integer and such, trying to fool the JIT, perhaps failed, but eh), and `collect(toMap())` subsequently became faster than using the `forEach` approach. In any case, I think it's fair to say that without having the exact full pipeline to test with, it's not really conclusive. (Ran all that with JDK 1.8.0_102 on a recent Window 10 x64). – Tunaki Oct 25 '16 at 17:45