Java: Count fast a number of items shared among multiple sets

Question

I have 10 sets on the input, where each set contains hundreds of items (strings).

What i want:

I want to find items that are shared by at least two sets and sort them by number of occurrences across different sets in descending order.

My approach:

I've created the following code. However, I want to know, if exists a more efficient way to do this...

Map<String, Integer> sharedCounts = new HashMap<>();

for (int i = 0; i < 10; i++) {
  Set<String> words = getWords(i);
  for (String word : words) {
    if (sharedCounts.containsKey(word)) {
      sharedCounts.put(word, commons.get(word) + 1);
    } else {
      sharedCounts.put(word, 1);
    }
  }
}

Map<String, Integer> sorted = sharedCounts.entrySet().stream()
    .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));

It looks ok. Map's `put` and `containsKey` have `O(1)` time complexity and otherwise you iterate over all sets over all words exactly once, giving `O(n)` where `n` is a total number of words in all sets. The slowest part is actually sorting, as this would be `O(nlogn)`. — Jaroslaw Pawlak, Nov 07 '16 at 16:43
By the way, I have just realised something - how is your `sharedCounts` map different from `sorted` map? They are both hash maps which means unordered. The whole sorting is redundant. Don't you want to collect to list? Or maybe to tree map, but making count a key and word a value? — Jaroslaw Pawlak, Nov 07 '16 at 16:47
Yep, you have to generate a tree map. I just tried, and it sorts alphabetically, no matter which word has greater count of occurences. — Igor Deruga, Nov 07 '16 at 16:49
What is the difference between the variables `sharedCounts` and `commons`? Are those supposed to be the same object? — Enwired, Nov 07 '16 at 18:47

Sergey Kalinichenko · Accepted Answer · 2016-11-07T16:47:40.197

2

Your algorithm for computing the count has asymptotic complexity of O(SUM(N_i)), where N_i is the size of i-th set of words. This is as fast as it gets.

It appears that you are missing a filtering step, where you discard words with the count of one.

Map<String, Integer> sorted = commons.entrySet().stream()
    .filter(e -> e.getValue() > 1) // <<== Add this line
    .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));

Overall complexity is O(SUM(N_i)+N_distinct-dup*log(N_distinct-dup)), where N_distinct-dup is the number of distinct words that have duplicates.

edited Nov 07 '16 at 16:47

answered Nov 07 '16 at 16:43

Sergey Kalinichenko

714,442
84
1,110
1,523

That's a nice practical improvement, however theoretical time complexity still stays as `O(nlogn)`. I am not sure whether we can make it any better than this. – Jaroslaw Pawlak Nov 07 '16 at 16:45
@JaroslawPawlak I added filtering to make the algorithm return what OP wants (i.e. items that are shared by at least two sets), not to improve the running time, which is, of course, O(N*logN) worst case. – Sergey Kalinichenko Nov 07 '16 at 16:49
If the order is `O(SUM(Ni)+Ndistinct-dup*log(Ndistinct-dup))` then we still don't know which part is slower, `O(SUM(Ni))` or `O(Ndistinct-dup*log(Ndistinct-dup))`. If `Sum(Ni)` is large and `Ndistinct` is small, then it would be useful to work on speeding up the counting section. There are ways to do that [here](http://stackoverflow.com/questions/81346/) – Enwired Nov 07 '16 at 19:02

MrBo · Answer 2 · 2016-11-07T17:29:38.740

-1

Instead of iterating over all elements in each Set you can simply combine all sets using oneSet.addAll(anotherSet) and then fill your sharedCounts map

Set<String> combinedSet = new HashSet<String>(); 

for (int i = 0; i < 10; i++) {
  Set<String> words = getWords(i);
  combinedSet.addAll(words);
}

Map<String, Integer> sharedCounts = new HashMap<>();

for (String word : combinedSet) {
  for (int i = 0; i < 10; i++) {
     Set<String> words = getWords(i);
     if (words.contains(word)) {
        if (sharedCounts.containsKey(word)) {
           sharedCounts.put(word, commons.get(word) + 1);
        } else {
           sharedCounts.put(word, 1);
        }
     }
   }
}

edited Nov 07 '16 at 17:29

answered Nov 07 '16 at 16:50

MrBo

19
3

And how are you going to create `sharedCounts` map having a single set with all words without duplicates? – Jaroslaw Pawlak Nov 07 '16 at 16:52
@JaroslawPawlak look up, I ment something like that – MrBo Nov 07 '16 at 17:31
So now think what is a time complexity of hash set's `contains` and how many times you do this. The code you proposed is not any shorter or more readable than OP's code, and also has far worse performance. – Jaroslaw Pawlak Nov 07 '16 at 17:46

Java: Count fast a number of items shared among multiple sets

2 Answers2