How to remove Keys that would cause Collisions before executing Collectors.toMap()

Question

I have a stream of objects similar to this previous question, however, instead of ignoring duplicate values, I would like to remove any values from that stream beforehand and print them out.

For example, from this snippet:

Map<String, String> phoneBook = people.stream()
                                      .collect(toMap(Person::getName,
                                                     Person::getAddress));

If there were duplicate entries, it would cause a java.lang.IllegalStateException: Duplicate key error to be thrown.

The solution proposed in that question used a mergeFunction to keep the first entry if a collision was found.

Map<String, String> phoneBook = 
    people.stream()
          .collect(Collectors.toMap(
             Person::getName,
             Person::getAddress,
             (address1, address2) -> {
                 System.out.println("duplicate key found!");
                 return address1;
             }
          ));

Instead of keeping the first entry, if there is a collision from a duplicate key in the stream, I want to know which value caused the collision and make sure that there are no occurrences of that value within the resulting map.

I.e. if "Bob" appeared three times in the stream, it should not be in the map even once.

In the process of creating that map, I would like to filter out any duplicate names and record them some way.

I want to make sure that when creating the list there can be no duplicate entry and for there to be some way to know which entries had duplicate keys in incoming stream. I was thinking about using groupingBy and filter beforehand to find the duplicate keys, but I am not sure what the best way to do it is.

To be clear, if you get two entries that result in a collision (duplicate key) you want to know in advance and then remove _both_ entries from the input stream? You didn't say that explicitly, but if not, how is your requirement different from using the merge function in your sample code? — Jim Garrison, Apr 28 '22 at 19:07
I don't understand this part "when creating the list there can be no duplicate entry". Aren't you creating a map with String keys and String values. Which list are you referring to? The people one? — Dan, Apr 28 '22 at 19:09
If there is a collision from a duplicate key in the stream, I want to know which value caused the collision and make sure that there are no occurrences of that value within the resulting map. I believe the merge function in the sample code results in a single entry for the first entry of the key-value pair. I think groupby and filter is needed as there may be multiple occurrences of the same key and I dont want any occurrences of that key in the final map — Jengels, Apr 28 '22 at 19:14
@Dan if people is streaming out a list of people objects and each has a name and address, I am trying to create a corresponding hash map of name to address where each name only ever occurred once in the stream (ie if "Bob" appeared three times in the stream, it should not be in the map even once). In the process of creating that map I would like to filter out any duplicate names and record them some way. — Jengels, Apr 28 '22 at 19:19
'Keys that would cause collisions' in a hash map is not the same thing as 'duplicate keys'. — user207421, Apr 29 '22 at 00:51

score 3 · Answer 1 · answered Apr 28 '22 at 19:22

3

You can't know which keys are duplicates until you have processed the entire input stream. Therefore, any pre-processing step has to make a complete pass of the input before your main logic, which is wasteful.

An alternate approach could be:

Use the merge function to insert a dummy value for the offending key
At the same time, insert the offending key into a Set<K>
After the input stream is processed, iterate over the Set<K> to remove offending keys from the primary map.

answered Apr 28 '22 at 19:22

Jim Garrison

85,615
20
155
190

2

Since the merge function doesn’t receive the key, it can’t add it to a set. But you don’t need to maintain a `Set` anyway. Just perform a `resultMap.values().removeIf(dummyValue::equals)` or `resultMap.values().removeAll(Set.of(dummyValue))` after collecting with `(a,b) -> dummyValue` as merge function (and to be clean, use `HashMap::new` as map provider then, to ensure mutability). – Holger Apr 29 '22 at 07:32

Alexander Ivanchenko · Accepted Answer · 2022-04-29T17:44:02.990

I would like to remove any values from that stream beforehand.

As @JimGarrison has pointed out, preprocessing the data doesn't make sense.

You can't know it in advance whether a name is unique or not until the all data set has been processed.

Another thing that you have to consider that inside the stream pipeline (before the collector) you have knowledge on what data has been encountered previously. Because results of intermediate operations should not depend on any state.

In case if you are thinking that streams are acting like a sequence of loops and therefore assuming that it's possible to preprocess stream elements before collecting them, that's not correct. Elements of the stream pipeline are being processed lazily one at a time. I.e. all the operations in the pipeline will get applied on a single element and each operation will be applied only if it's needed (that's what laziness means).

For more information, have a look at this tutorial and API documentation

Implementations

You can segregate unique values and duplicates in a single stream statement by utilizing Collectors.teeing() and a custom object that will contain separate collections of duplicated and unique entries of the phone book.

Since the primarily function of this object only to carry the data I've implemented it as Java 16 record.

public record FilteredPhoneBook(Map<String, String> uniquePersonsAddressByName,
                                List<String> duplicatedNames) {}

Collector teeing() expects three arguments: two collectors and a function that merges the results produced by both collectors.

The map generated by the groupingBy() in conjunction with counting(), is meant to determine duplicated names.

Since there's no point to processing the data, toMap() which is used as the second collector will create a map containing all names.

When both collectors will hand out their results to the merger function, it will take care of removing the duplicates.

public static FilteredPhoneBook getFilteredPhoneBook(Collection<Person> people) {
    return people.stream()
        .collect(Collectors.teeing(
            Collectors.groupingBy(Person::getName, Collectors.counting()), // intermediate Map<String, Long>
            Collectors.toMap(                                              // intermediate Map<String, String>
                Person::getName,
                Person::getAddress,
                (left, right) -> left),
            (Map<String, Long> countByName, Map<String, String> addressByName) -> {
                countByName.values().removeIf(count -> count == 1);        // removing unique names
                addressByName.keySet().removeAll(countByName.keySet());    // removing all duplicates
                
                return new FilteredPhoneBook(addressByName, new ArrayList<>(countByName.keySet()));
            }
        ));
}

Another way to address this problem to utilize Map<String,Boolean> as the mean of discovering duplicates, as @Holger have suggested.

With the first collector will be written using toMap(). And it will associate true with a key that has been encountered only once, and its mergeFunction will assign the value of false if at least one duplicate was found.

The rest logic remains the same.

public static FilteredPhoneBook getFilteredPhoneBook(Collection<Person> people) {
    return people.stream()
        .collect(Collectors.teeing(
            Collectors.toMap(            // intermediate Map<String, Boolean>
                Person::getName,
                person -> true,          // not proved to be a duplicate and initially considered unique
                (left, right) -> false), // is a duplicate
            Collectors.toMap(            // intermediate Map<String, String>
                Person::getName,
                Person::getAddress,
                (left, right) -> left),
            (Map<String, Boolean> isUniqueByName, Map<String, String> addressByName) -> {
                isUniqueByName.values().removeIf(Boolean::booleanValue);   // removing unique names
                addressByName.keySet().removeAll(isUniqueByName.keySet()); // removing all duplicates

                return new FilteredPhoneBook(addressByName, new ArrayList<>(isUniqueByName.keySet()));
            }
        ));
}

main() - demo

public static void main(String[] args) {
    List<Person> people = List.of(
        new Person("Alise", "address1"),
        new Person("Bob", "address2"),
        new Person("Bob", "address3"),
        new Person("Carol", "address4"),
        new Person("Bob", "address5")
    );

   FilteredPhoneBook filteredPhoneBook = getFilteredPhoneBook(people);
        
    System.out.println("Unique entries:");
    filteredPhoneBook.uniquePersonsAddressByName.forEach((k, v) -> System.out.println(k + " : " + v));
    System.out.println("\nDuplicates:");
    filteredPhoneBook.duplicatedNames().forEach(System.out::println);
}

Output

Unique entries:
Alise : address1
Carol : address4

Duplicates:
Bob

Mind that you don’t need to *count* when you only want to know *whether* there’s more than one. Replace `Collectors.groupingBy(Person::getName, Collectors.counting())` with `Collectors.toMap(Person::getName, () -> false, (n1,n2) -> true)` and `.filter(entry -> entry.getValue() > 1)` with `.filter(Map.Entry::getValue)` and you’re also only keeping duplicates. But I’d prefer Jim Garrison’s placeholder approach instead of maintaining two maps. — Holger, Apr 29 '22 at 07:39
@Holger Sure, you are right. I've read Jim's answer before posting this solution, and I understood his approach as making use of `Map` and then operate on the key set. And I agree that `Map` is more expressive in regard of was its purpose is. I've used `groupingBy()` in conjunction with `counting()` only because it's roughly the same in terms of performance and slightly more concise. But `Map` is admittedly more descriptive about what it does. — Alexander Ivanchenko, Apr 29 '22 at 14:56
@Holger I've updated the answer by refining the `merger` function and added an implementation which makes use of `Map`. BTW, I didn't get what mean when you said that there's no need to maintain two map. We can't extract address from the `Map` and can't use `Person` as a key because uniqueness is based on the name attribute, not on the equality of person objects. Or I'm getting you wrong? — Alexander Ivanchenko, Apr 29 '22 at 15:02
You are creating two maps with the same key, `Map` and `Map`, the key being the name in either case. You can use the approach described in Jim Garrison’s answer, using a merge function replacing the address with a duplicate marker instead. Or you map to a record containing the address and a boolean flag instead. No-one suggested to use a `Person` as key. I don’t know why you are discussing this. — Holger, Apr 29 '22 at 16:15

score 0 · Answer 3 · answered Apr 28 '22 at 21:18

In mathematical terms you want to partition your grouped aggregate and handle both parts separately.

Map<String, String> makePhoneBook(Collection<Person> people) {
    Map<Boolean, List<Person>> phoneBook = people.stream()
            .collect(Collectors.groupingBy(Person::getName))
            .values()
            .stream()
            .collect(Collectors.partitioningBy(list -> list.size() > 1,
            Collectors.mapping(r -> r.get(0),
                    Collectors.toList())));

    // handle duplicates
    phoneBook.get(true)
            .forEach(x -> System.out.println("duplicate found " + x));

    return phoneBook.get(false).stream()
            .collect(Collectors.toMap(
                    Person::getName,
                    Person::getAddress));
}

How to remove Keys that would cause Collisions before executing Collectors.toMap()

3 Answers3

Implementations