21

So I have a list from which I obtain a parallel stream to fill out a map, as follows:

Map<Integer, TreeNode> map = new HashMap<>();
List<NodeData> list = some_filled_list;

//Putting data from the list into the map
list.parallelStream().forEach(d -> {
                TreeNode node = new TreeNode(d);
                map.put(node.getId(), node);
            });

//print out map
map.entrySet().stream().forEach(entry -> {
     System.out.println("Processing node with ID = " + entry.getValue().getId());
                });

The problem with this code is that the map is being printed out when the "putting data" process is still going on (cuz it's parallel), so as a result, map has not yet received all the elements from the list yet. Of course, in my real code, it is not just printing out the map; I use a map to take advantage of O(1) lookup time.

My question is:

  1. how to make the main thread wait so that the "putting data" is finished before the map is printed out? I tried to put the "putting data" inside a thread t, and do t.start() and t.join(), but that doesn't help.

  2. Maybe I am not supposed to use parallel stream in this case? The list is long, and I just want to take advantage of the parallelism to improve efficiency.

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Simo
  • 2,292
  • 5
  • 30
  • 45
  • 1
    Is creating a `TreeNode` instance by means of its constructor `new TreeNode(d)` computationally expensive? I mean if it has heavy CPU load. If the answer is no, then you will gain nothing by using a parallel stream. Parallel streams are useful only to a very small % of applications which perform very expensive CPU bound tasks. Parallel streams are not meant to be used in scenarios where only concurrency is needed (such as calling remote services without sequentially waiting for each service's result). – fps Jan 12 '18 at 18:45

3 Answers3

23

With this list.parallelStream().forEach you are violating the side-effects property that is explicitly stated in the Stream documentation.

Also when you say this code is that the map is being printed out when the "putting data" process is still going on (cuz it's parallel), this is not true, as forEach is a terminal operation and it will wait to be finished, until it can go an process the next line. You might be seeing that as such, since you are collecting to a non thread-safe HashMap and some entries might not be in that map... Think about about other way, what would happen if you would put multiple entries from multiple threads in a HashMap? Well, lots of things can break, like missing entries, on incorrect/inconsistent Map, etc.

Of course, changing that to a ConcurrentHashMap would work, since it's thread-safe, but you are still violating the side-effect property, although in a "safe" way.

The correct thing to do is to collect to a Map directly without forEach:

Map<Integer, TreeNode> map = list.parallelStream()
        .collect(Collectors.toMap(
                NodeData::getId,
                TreeNode::new
        ));

This way, even for parallel processing, everything would be fine. Just notice that you would need lots (tens of thousands elements) to have any measurable performance increase from parallel processing.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • 3
    thumbs up for mentioning that parallel streams are not the holy grail for anything – Lino Jan 12 '18 at 13:08
  • 4
    One of the most interesting examples with inconsistent `HashMap`s that actually happen in real life, is to run into an infinite loop, with apparently having the nodes linked circularly in way, that would never be possible when looking at the code with a mindset of sequential execution. Oh and, `list.parallelStream() .collect(Collectors.toMap( NodeData::getId, TreeNode::new))` is one of the scenarios where it is almost impossible to gain from parallel, regardless of the number of elements, as the merging costs are on par with any potential saving. – Holger Jan 12 '18 at 14:02
  • @Holger this one is on my looong list of someday I will find time. indeed merging two maps is pretty expensive... – Eugene Jan 12 '18 at 14:10
  • @Eugene well, then, if you find the time, [this one](https://stackoverflow.com/a/41045442/2711488) might be a good read on this (keeping in mind that `toMap(…)` without merge function does not allow duplicate keys). – Holger Jan 12 '18 at 14:19
5

Stream operations will block until done for both - parallel and not parallel implementations.

So what you see is not the "putting data" process is still going on - most likely it's just data corruption, since HashMap is not threadsafe. Try using ConcurrentHashMap instead.

  • Using ConcurrentHashMap did help. Just tested. However, it is still not clear to me, why things work properly when I did not use parallel stream, and used a sequential stream instead? So what is corrupted in my regular HashMap here when I used a parallel stream? and why it is not corrupted when used with a sequential stream? – Simo Jan 12 '18 at 06:58
  • Basically `put` method reads `HashMap`'s table, then adds some element to it, then writes it back to memory. In serial (sequential) stream that works just great, but in parallel world two threads can read some value, then modify them independently and write result back to memory. So the last write will not contain changes from other thread and they will be overwritten. – Stadnyk Oleksii Jan 12 '18 at 08:19
  • 1
    That is quite messy explanation, so you probably better read article or two on thread safety topic. Something like this should clarify that for you a little: https://www.journaldev.com/1061/thread-safety-in-java – Stadnyk Oleksii Jan 12 '18 at 08:26
2

I would guess that if it is possible for the stream to still be processing you could try something like:

    List<NodeData> list = new ArrayList<>();

    //Putting data from the list into the map
    Map<Integer, TreeNode> map = list.parallelStream()
            .collect(Collectors.toMap(
                    n -> n.getId(),
                    n -> new TreeNode(n)
            ));

At least now you have a terminal on the stream. You will use multiple threads possible and the mapping is certainly going to be complete.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213