Performing a UNION operation among keys of N hash tables

Question

I have a hash table containing N hast tables:

Map<Integer,Map<String,Double>

I need to create a list containing all the keys of the inner maps:

----------------
|     | a    2 |
| 100 | b    1 |    
|     | c    2 |
----------------
|     | a    2 |
| 101 | d    2 |     
---------------- 
|     | a    2 |
| 102 | b    1 |    
|     | e    2 |
----------------

list = {a,b,c,d,e}

Here my current code:

Set<String> keys= new HashSet<>();
    map1.entrySet().forEach(e -> {
        keys.addAll(e.getValue().keySet());
    });

map1 contains thousands of entries.

Is this the optimal approach? Anyone know a faster way?

Why don't you use `map.entrySet().stream().map(Map.Entry::getKey).collect(Collectors.toList());` ? This way you would not need the nested `addAll()` call and not need to initialize an empty list before. Thats of course just cosmetics. BUT: I do not know if using `.parallel()` on the stream would possibly increase performance. — JDC, Dec 19 '16 at 12:22
@JDC I'm testing your solution, but `Collectors` cannot be resolved — Fab, Dec 19 '16 at 12:33
You have to add it to the import statements of your class: `import java.util.stream.Collectors;` — JDC, Dec 19 '16 at 12:34
@PatrickParker: was not yet an answer, but a question why he is not using it. But I will add it as an answer. — JDC, Dec 19 '16 at 12:38
@JDC I think it is an Eclipse bug. I retyped the code line and no errors appear. Two equal lines: one OK, one with error -> MISTERY — Fab, Dec 19 '16 at 12:46
@Fab I've made an edit to my code which shows how you can potentially shave a second off the worst case time, if that is your goal. — Patrick Parker, Dec 19 '16 at 15:24

score 3 · Answer 1 · edited May 23 '17 at 12:30

3

You could try to use the following code:

Map<String, Double> innerMap = new HashMap<>();
innerMap.put("a", 2d);
innerMap.put("b", 2d);
innerMap.put("c", 2d);

Map<String, Double> innerMap2 = new HashMap<>();
innerMap2.put("a", 2d);
innerMap2.put("d", 2d);
innerMap2.put("e", 2d);

Map<Integer, Map<String, Double>> map = new HashMap<>();
map.put(100, innerMap);
map.put(101, innerMap2);

Set<String> collect = map.values()
                         .stream()
                         .parallel()
                         .map(Map::keySet)
                         .flatMap(Collection::stream)
                         .collect(Collectors.toSet());

Unfortunately you will have to try it yourself if it has significant performance impact.

This implies that you are using Java 8. But as you are using the method forEach() I just assume so.

Edit:
Please be aware of the comment of D. Kovács regarding the usage of the parallel() method: Details

edited May 23 '17 at 12:30

Community

1
1

answered Dec 19 '16 at 12:39

JDC

4,247
5
31
74

Don't use parallel, without proper justification. See [this Gist](https://gist.github.com/AFulgens/ba1fec3235cfda1269550fb8e9793db3) for details, TL;DR: NQ >= 10'000 (N: number of datapoints; Q: amount of work) should be satisfied even to begin to think about parallel streams. In this case N == entries in all the maps; Q ≈ 1. – D. Kovács Dec 19 '16 at 12:42
@D.Kovács interesting, did not know about that! – JDC Dec 19 '16 at 12:57
@SME_Dev You are right, updated the code accordingly – JDC Dec 19 '16 at 12:57
1

@JDC correct me if I'm wrong, but the doc for toSet says "There are no guarantees on the type, mutability, serializability, or thread-safety of the Set returned"... so doesn't that violate your use of `parallel()` ? – Patrick Parker Dec 19 '16 at 13:35
@PatrickParker indeed, you are correct, see the [Oracle tutorial](https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) (last footnote at the bottom of the page) – D. Kovács Dec 19 '16 at 13:41
1

@PatrickParker I found the following post that should handle your question: http://stackoverflow.com/questions/22350288/parallel-streams-collectors-and-thread-safety – JDC Dec 19 '16 at 13:52
@Kovács Actually, the size of the outer map can vary. In the worst case, I have 20'000 map, each of them consisting of 300 keys. But I can also have 50 maps consisting of only 1 key. So, the `NQ >= 10'000` is satisfied only in some cases. Should I avoid to use `.parallel` for this reason? – Fab Dec 19 '16 at 14:41
@Fab I would advice for avoiding it, see the linked Gist for details. Of course (as everything in Software Engineering), my advice is debatable :) – D. Kovács Dec 19 '16 at 15:48

Patrick Parker · Answer 2 · 2016-12-19T14:52:15.900

1

I think you can take more advantage of parallelism by using a flatMap()

You may also wish to consider a ConcurrentHashMap if you can estimate the needed size, i.e. Collections.newSetFromMap(new ConcurrentHashMap())

Example:

    final int EST_SIZE = 6_000_000;
    // Map<Integer,Map<String,Double>> map1;
    Set<String> keys = map1.values().stream().map(Map::keySet)
            .flatMap(Set::stream).parallel().unordered()
            .collect(Collector.of(
                    () -> Collections.newSetFromMap(
                        new ConcurrentHashMap<>(EST_SIZE * 4 / 3 + 1)
                    ),
                    Set::add,
                    (set1,set2) -> { 
                        set1.addAll(set2);
                        return set1;
                    },
                    Collector.Characteristics.CONCURRENT,
                    Collector.Characteristics.UNORDERED
                    ));

Note: the above code is an attempted optimization for a highly specialized scenario. Test it and check performance before deciding. You should prefer something like this in the general case:

    Set<String> keys = map1.values().stream().map(Map::keySet)
            .flatMap(Set::stream).collect(Collectors.toSet());

edited Dec 19 '16 at 14:52

answered Dec 19 '16 at 12:51

Patrick Parker

4,863
4
19
51

Any specific reason to use ConcurrentSkipListSet? It is one of the slowest Set implementations, which should only be used when justified. Also don't use parallel streams, unless justified :) – D. Kovács Dec 19 '16 at 12:57
@D.Kovács well, the OP did say "thousands" of maps, so that potentially means thousands x thousands of keys. He also framed his question as being a performance issue with a very large data set, and that is something which can benefit from parallelism. Of course without his data I can't test it to know for sure. For your other question see http://stackoverflow.com/a/6720658/7098259 – Patrick Parker Dec 19 '16 at 13:32
fair enough, but even for a couple millions entries in total it shouldn't take more than a couple of seconds, w/o parallelism. I have to admit, that I dislike the parallel streams (see my other comment with a linked Gist about the topic). I just have a feeling that these additions (i.e., parallel + CSkipLSet) are premature optimizations. – D. Kovács Dec 19 '16 at 13:36
@D.Kovács my highly unscientific test with 6 million keys showed a significant speedup. ConcurrentHashMap = 1971 (millis), Collectors.toSet = 3354 (millis) – Patrick Parker Dec 19 '16 at 15:22
indeed, that's what I would expect. That's 1 second difference. If you do this every couple seconds, or every minute, than you would be interested in this (parallel comes with a lot of headaches e.g., because of the common ForkJoinPool). If you do it once a day during lazy time (statistics when there is no server load), then who cares? :) – D. Kovács Dec 19 '16 at 15:42

D. Kovács · Accepted Answer · 2019-10-17T07:11:00.917

1

Essentially your approach is optimal for the generic case. You can of course tailor it, if you have additional constraints on the Map structure, key distribution, etc. But for the generic case, I don't see a better method, maybe other syntax (e.g., use streams all the way:

map.entrySet()
   .stream()
   .map(Map.Entry::getValue)
   .flatMap(e -> e.keySet().stream())
   .collect(Collectors.toSet())

)

edited Oct 17 '19 at 07:11

answered Dec 19 '16 at 12:56

D. Kovács

1,232
13
25

Performing a UNION operation among keys of N hash tables

3 Answers3