Remove all instances of an item in a list if it appears more than once

Question

Given a List of numbers: { 4, 5, 7, 3, 5, 4, 2, 4 }

The desired output would be: { 7, 3, 2 }

The solution I am thinking of is create below HashMap from the given List:

Map<Integer, Integer> numbersCountMap = new HashMap();

where key is the value of from the list and value is the occurrences count.

Then loop through the HashMap entry set and where ever the number contains count greater than one remove that number from the list.

for (Map.Entry<Int, Int> numberCountEntry : numbersCountMap.entrySet()) {
     if(numberCountEntry.getValue() > 1) {  
        testList.remove(numberCountEntry.getKey());
     }
}

I am not sure whether this is an efficient solution to this problem, as the remove(Integer) operation on a list can be expensive. Also I am creating additional Map data structure. And looping twice, once on the original list to create the Map and then on the map to remove duplicates.

Could you please suggest a better way. May be Java 8 has better way of implementing this. Also can we do it in few lines using Streams and other new structures in Java 8?

Do not open multiple questions, given that you edited and clarified the distinction in the first I have reopened that. Close either https://stackoverflow.com/questions/62676162/efficient-way-of-remove-any-element-which-repeats-more-than-once-including-the — Naman, Jul 01 '20 at 14:09

Youcef LAIDANI · Answer 1 · 2020-07-01T15:04:15.250

5

By streams you can use:

Map<Integer, Long> grouping = integers.stream()
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
grouping.values().removeIf(c -> c > 1);
Set<Integer> result = grouping.keySet();

Or as @Holger mentioned, all you want to know, is whether there is more than one integer in your list, so just do:

Map<Integer, Boolean> grouping = integers.stream()
        .collect(Collectors.toMap(Function.identity(),
                x -> false, (a, b) -> true,
                HashMap::new));
grouping.values().removeIf(b -> b);
// or
grouping.values().removeAll(Collections.singleton(true));
Set<Integer> result = grouping.keySet();

edited Jul 01 '20 at 15:04

answered Jul 01 '20 at 14:10

Youcef LAIDANI

55,661
15
90
140

1

What is `integers.stream()` supposed to be? – Clijsters Jul 01 '20 at 14:11
@Clijsters `List` – Youcef LAIDANI Jul 01 '20 at 14:11
Maybe my question wasn't that clear. It's obvious that it should be a List of Integer, but I can't see it being declared in your code or the OPs code. – Clijsters Jul 01 '20 at 14:13
1

@Clijsters the full code look like [this](https://ideone.com/VlJTI1) and the OP said *Given a List of numbers* for me it is clear – Youcef LAIDANI Jul 01 '20 at 14:15
5

Mind that these collectors do not guaranty that the returned map is mutable, so you should use the variant with an explicit map factory instead. Further, you don’t need to count the occurrences, all you want to know, is whether there is more than one: `Map grouping = integers.stream() .collect(Collectors.toMap(Function.identity(), x -> false, (a,b) -> true, HashMap::new));`, followed by either, `grouping.values().removeIf(b -> b);` or `grouping.values().removeAll(Collections.singleton(true));`. Performing `indexOf` in a loop over all elements yields O(n²), so don’t do it. – Holger Jul 01 '20 at 14:53
@Holger very nice solution, it you don't mind I will change my answer with your solution – Youcef LAIDANI Jul 01 '20 at 15:01
@YCF_L You're mixing functional and procedural approaches here. This may lead to future maintenance issues. – ETO Jul 02 '20 at 14:20

score 3 · Answer 2 · answered Jul 02 '20 at 14:18

While YCF_L's answer does the thing and yields the correct result, I don't think it's a good solution to go with, since it mixes functional and procedural approaches by mutating the intermediary collection.

A functional approach would assume either of the following solutions:

Using intermediary variable:

Map<Integer, Boolean> map =
    integers.stream()
            .collect(toMap(identity(), x -> true, (a, b) -> false));

List<Integer> result = map.entrySet()
                          .stream()
                          .filter(Entry::getValue)
                          .map(Entry::getKey)
                          .collect(toList());

Note that we don't even care about the mutability of the map variable. Thus we can omit the 4th parameter of toMap collector.

Chaining two pipelines (similar to Alex Rudenko's answer):

List<Integer> result =
    integers.stream()
            .collect(toMap(identity(), x -> true, (a, b) -> false))
            .entrySet()
            .stream()
            .filter(Entry::getValue)
            .map(Entry::getKey)
            .collect(toList());

This code is still safe, but less readable. Chaining two or more pipelines is discouraged.

Pure functional approach:

List<Integer> result =
    integers.stream()
            .collect(collectingAndThen(
                        toMap(identity(), x -> true, (a, b) -> false),
                        map -> map.entrySet()
                                  .stream()
                                  .filter(Entry::getValue)
                                  .map(Entry::getKey)
                                  .collect(toList())));

The intermediary state (the grouped map) does not get exposed to the outside world. So we may be sure nobody will modify it while we're processing the result.

Nice answer, my answer was quick but there are many ways in programming — Youcef LAIDANI, Jul 02 '20 at 14:26

Nowhere Man · Answer 3 · 2020-07-01T14:39:42.363

You may count the frequency of each number into LinkedHashMap to keep insertion order if it's relevant, then filter out the single numbers from the entrySet() and keep the keys.

List<Integer> data = Arrays.asList(4, 5, 7, 3, 5, 4, 2, 4);
List<Integer> singles = data.stream()
    .collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()))
    .entrySet().stream()
    .filter(e -> e.getValue() == 1)
    .map(Map.Entry::getKey)
    .collect(Collectors.toList());

System.out.println(singles);

Printed output:

[7, 3, 2]

score 1 · Answer 4 · answered Jul 01 '20 at 14:18

It's overengineered for just this problem. Also, your code is faulty:

It's Integer, not Int (minor niggle)
More importantly, a remove call removes the first matching element, and to make matters considerably worse, remove on lists is overloaded: There's remove(int) which removes an element by index, and remove(Object) which removes an element by looking it up. In a List<Integer>, it is very difficult to know which one you're calling. You want the 'remove by lookup' one.

On complexity:

On modern CPUs, it's not that simple. The CPU works on 'pages' of memory, and because fetching a new page takes on the order of 500 cycles or more, it makes more sense to simplify matters and consider any operation that does NOT require a new page of memory to be loaded, to be instantaneous.

That means that if we're talking about a list of, say, 10,000 numbers or fewer? None of it matters. It'll fly by. Any debate about 'efficiency' is meaningless until we get to larger counts.

Assuming that 'efficiency' is still relevant:

integers don't have hashcode collisions.
hashmaps with few to no key hash collisions are effectively O(1) on all single element ops such as 'add' and 'get'.
arraylist's .remove(Object) method is O(n). It takes longer the larger the list is. In fact, it is doubly O(n): it takes O(n) time to find the element you want to remove, and then O(n) time again to actually remove it. For fundamental informatics twice O(n) is still O(n) but pragmatically speaking, arrayList.remove(item) is pricey.
You're calling .remove about O(n) times, making the total complexity O(n^2). Not great, and not the most efficient algorithm. Practically or fundamentally.

An efficient strategy is probably to just commit to copying. A copy operation is O(n). For the whole thing, instead of O(n) per item. Sorting is O(n log n). This gives us a trivial algorithm:

Sort the input. Note that you can do this with an int[] too; until java 16 is out and you can use primitives in collections, int[] is an order of magnitude more efficient than a List<Integer>.
loop through the sorted input. Don't immediately copy, but use an intermediate: For the 0th item in the list, remember only 'the last item was FOO' and 'how many times did I see foo?'. Then, for any item, check if it is the same as the previous. If yes, increment count. If not, check the count: if it was 1, add it to the output, otherwise don't. In any case, update the 'last seen value' to the new value and set the count to 1. At the end, make sure to add the last remembered value if the count is 1, and make sure your code works even for empty lists.

That's O(n log n) + O(n) complexity, which is O(n log n) - a lot better than your O(n^2) take.

Use int[], and add another step that you first go through juuust to count how large the output would be (because arrays cannot grow/shrink), and now you have a time complexity of O(n log n) + 2*O(n) which is still O(n log n), and the lowest possible memory complexity, as sort is in-place and doesn't cost any extra.

If you really want to tweak it, you can go with a space complexity of 0 (you can write the reduced list inside the input).

One problem with this strategy is that you mess up the ordering in the input. The algorithm would produce 2, 3, 7. If you want to preserve order, you can combine the hashmap solution with the sort, and make a copy as you loop solution.

A space complexity of zero would imply that you use your own sort implementation, as the builtin sort methods do not guaranty a zero space complexity. And in-place sort algorithms usually come at a price regarding time complexity. — Holger, Jul 01 '20 at 15:00
Pretty sure the built in sort algorithm does: It just swaps elements around; it declares zero new objects. — rzwitserloot, Jul 01 '20 at 15:11
Which algorithm are you referring to? Since you are assuming O(n log n), you can’t refer to Quicksort, which has a worst case of O(n²). The OpenJDK implementation of `Arrays.sort(…)` mitigates this by using Mergesort in certain cases, but that’s precisely the reason why it does not have zero space complexity. — Holger, Jul 01 '20 at 15:19
*"integers don't have hashcode collisions."* This is true but misleading. The `.hashCode` method will return the integer's value itself, but for the purpose of a hashtable, a collision occurs when the index in the underlying array (which is something like the hashCode modulo the array's length) is the same as the index for a different element. Since the underlying array's size isn't 2^32, collisions should be expected. — kaya3, Jul 01 '20 at 15:39
@kaya3 within the scope of this answer, get and put operations on `Map` are going to be O(1) and no construction of keys is going to mess with this. Contrast to something like a `Map` where constructing strings that all have the same hashcode turns those operations into O(n). — rzwitserloot, Jul 02 '20 at 00:05
That's not true; if all the integers in the input are positive multiples of the hashtable's underlying array length, then they will all collide. — kaya3, Jul 02 '20 at 15:19

score 0 · Answer 5 · answered Jul 05 '20 at 03:49

You can use 3-argument reduce method and walk through the stream only once, maintaining two sets of selected and rejected values.

final var nums = Stream.of(4, 5, 7, 3, 5, 4, 2, 4);
final var init = new Tuple<Set<Integer>>(new LinkedHashSet<Integer>(), new LinkedHashSet<Integer>());
final var comb = (BinaryOperator<Tuple<Set<Integer>>>) (a, b) -> a;
final var accum = (BiFunction<Tuple<Set<Integer>>, Integer, Tuple<Set<Integer>>>) (t, elem) -> {
   if (t.fst().contains(elem)) {
      t.snd().add(elem);
      t.fst().remove(elem);
   } else if (!t.snd().contains(elem)) {
      t.fst().add(elem);
   }
   return t;
};
Assertions.assertEquals(nums.reduce(init, accum, comb).fst(), Set.of(7, 3, 2));

In this example, Tuple were defined as record Tuple<T> (T fst, T snd) { }

WJS · Answer 6 · 2020-07-02T15:04:01.487

-1

Decided against the sublist method due to poor performance on large data sets. The following alternative is faster, and holds its own against stream solutions. Probably because Set access to an element is in constant time. The downside is that it requires extra data structures. Give an ArrayList list of elements, this seems to work quite well.

Set<Integer> dups = new HashSet<>(list.size());
Set<Integer> result = new HashSet<>(list.size());
            
for (int i : list) {
    if (dups.add(i)) {
        result.add(i);
        continue;
    }
    result.remove(i);
}

edited Jul 02 '20 at 15:04

answered Jul 01 '20 at 15:00

WJS

36,363
4
24
39

Updated answer. – WJS Jul 01 '20 at 22:36

Remove all instances of an item in a list if it appears more than once

6 Answers6