21

If I have:

List<Integer> listInts = { 1, 1, 3, 77, 2, 19, 77, 123, 14, 123... }

in Java what is an efficient way of creating a List<Integer> listDistinctInts containing only the distinct values from listInts?

My immediate thought is to create a Set<Integer> setInts containing all the values from listInts then call List<Integer> listDistinctInts = new ArrayList<>(setInts);

But this seems potentially inefficient - is there a better solution using Java 7?

I'm not using Java 8, but I believe using it I could do something like this(?):

List<Integer> listDistinctInts = listInts.stream().distinct().collect(Collectors.toList());

Would this be more performant than the approach above and/or is there any more efficient way of doing this in Java 8?

Finally, (and I'm aware that asking multiple questions might be frowned upon but it's directly related) if I only cared about the count of distinct elements in listInts is there a more efficient way to get that value (in Java 7 and 8) - without first creating a list or set of all the distinct elements?

I'm most interested in native Java ways of accomplishing this and avoiding re-inventing any wheels but would consider hand-rolled code or libraries if they offer better clarity or performance. I've read this related question Java - Distinct List of Objects but it's not entirely clear about the differences in performance between the Java 7 and 8 approaches or whether there might be better techniques?

Community
  • 1
  • 1
Matt Coubrough
  • 3,739
  • 2
  • 26
  • 40
  • Possibly more for [codereview](http://codereview.stackexchange.com)? – hd1 Dec 13 '14 at 23:37
  • Possibly, but it's a specific question on ways to accomplish things in Java... I haven't yet written the code, and often one's first idea is not the best way of accomplishing something! – Matt Coubrough Dec 13 '14 at 23:39
  • I have run benchmarks for fun before and using a HashSet is generally the most efficient except in unusual cases. – Radiodef Dec 13 '14 at 23:45
  • @Radiodef Even faster than the `Java8 stream().distinct()` approach? – Matt Coubrough Dec 13 '14 at 23:46
  • I ran these benchmarks before Java 8. It could be faster. – Radiodef Dec 13 '14 at 23:46
  • The only way I was able to beat the HashSet was to roll my own table using arrays to avoid boxing. I'd expect `distinct` to use HashSet as well. I poked around in the source code and this seems to be true. – Radiodef Dec 14 '14 at 01:32
  • http://stackoverflow.com/questions/12242083/remove-duplicates-from-list-using-guava/12248480#12248480 ? – Louis Wasserman Dec 14 '14 at 02:10
  • @LouisWasserman The `immutableSet` approach looks promising... Does that actually return a new `List` that can be added to and removed from later, or does it return an immutable list? – Matt Coubrough Dec 14 '14 at 22:36
  • It returns an immutable list, but thats usually preferable for most applications. – Louis Wasserman Dec 14 '14 at 22:37
  • The generic class `Stream` doesn’t know if its element type is `Integer` and hence can’t use optimized data structures. It would be different when using `listInts.stream().mapToInt(Integer::intValue).distinct().boxed().collect(Collectors.toList())`. In other word, using `IntStream.distinct` rather than `Stream.distinct`. But I don’t know whether it makes a difference for the current implementation… – Holger Dec 14 '14 at 23:37
  • @Holger Ok... I'm hoping to install Java 8 when I have time and do some profiling of the different methods suggested here if nobody knows the answer, so thanks for the info! – Matt Coubrough Dec 14 '14 at 23:41
  • I just thought about it and I think, for Java 8 it will not make much differences given how hashing works. The Java 7 hash calculations are more complex to avoid collisions but for Java 8 the dealing with collisions has been improved and the hashing much simplified. It actually is almost like having a directly addressed linear table (when the numbers aren’t sparse). After all, `Integer.hashCode` just returns the integer value… – Holger Dec 15 '14 at 00:06

7 Answers7

41

I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".

All my testing was performed with JMH Java Microbenchmarking Harness.

Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:


1. Java 8 Stream - The solution using stream() I had proprosed as a possibility if using Java8:

public List<Integer> testJava8Stream(List<Integer> listInts) {
    return listInts.stream().distinct().collect(Collectors.toList());
}

pros modern Java 8 approach, no 3rd party dependencies

cons Requires Java 8


2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:

public List<Integer> testAddingToList(List<Integer> listInts) {
    List<Integer> listDistinctInts = new ArrayList<>(listInts.size());
    for(Integer i : listInts)
    {
        if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }
    }
    return listDistinctInts;
}

pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps

cons Needs to repeatedly check the List for existing values as we build it


3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList:

public List<Integer> testGsCollectionsFast(FastList listFast)
{
    return listFast.distinct();
}

pros Reportedly very quick, simple expressive code, works in Java 7 and 8

cons Requires 3rd party library and a FastList rather than a regular List<Integer>


4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList passed to the method rather than a good ol' ArrayList<Integer> so I also tested the adapter method Craig proposed:

public List<Integer> testGsCollectionsAdapted(List<Integer> listInts)
{
    return listAdapter.adapt(listInts).distinct();
}

pros Doesn't require a FastList, works in Java 7 and 8

cons Has to adapt List so may not perform as well, needs 3rd party library


5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:

public List<Integer> testGuavaImmutable(List<Integer> listInts)
{
    return ImmutableSet.copyOf(listInts).asList();
}

pros Reportedly very fast, works in Java 7 or 8

cons Returns an Immutable List, can't handle nulls in the input List, and requires 3rd party library


7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)

public List<Integer> testHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new HashSet<Integer>(listInts));
}

pros Works in Java 7 and 8, no 3rd party dependencies

cons Doesn't retain original order of list, has to construct set then copy to list.


6. LinkedHashSet - Since the HashSet solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:

public List<Integer> testLinkedHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts));
}

pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies

cons Unlikely to be as fast as regular HashSet approach


Results

Here are my results for various different sizes of listInts (results ordered from slowest to fastest):

1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10        0.505        0.012    ops/s
Java8Stream             thrpt        10      234.932       31.959    ops/s
LinkedHashSet           thrpt        10      262.185       16.679    ops/s
HashSet                 thrpt        10      264.295       24.154    ops/s
GsCollectionsAdapted    thrpt        10      357.998       18.468    ops/s
GsCollectionsFast       thrpt        10      363.443       40.089    ops/s
GuavaImmutable          thrpt        10      469.423       26.056    ops/s

2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10    32794.698     1154.113    ops/s
HashSet                 thrpt        10    61622.073     2752.557    ops/s
LinkedHashSet           thrpt        10    67155.865     1690.119    ops/s
Java8Stream             thrpt        10    87440.902    13517.925    ops/s
GsCollectionsFast       thrpt        10   103490.738    35302.201    ops/s
GsCollectionsAdapted    thrpt        10   143135.973     4733.601    ops/s
GuavaImmutable          thrpt        10   186301.330    13421.850    ops/s

3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10   278435.085    14229.285    ops/s
Java8Stream             thrpt        10   397664.052    24282.858    ops/s
LinkedHashSet           thrpt        10   462701.618    20098.435    ops/s
GsCollectionsAdapted    thrpt        10   477097.125    15212.580    ops/s
GsCollectionsFast       thrpt        10   511248.923    48155.211    ops/s
HashSet                 thrpt        10   512003.713    25886.696    ops/s
GuavaImmutable          thrpt        10  1082006.560    18716.012    ops/s

4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

Java8Stream             thrpt        10  2739774.758   306124.297    ops/s
LinkedHashSet           thrpt        10  3607479.332   150331.918    ops/s
HashSet                 thrpt        10  4238393.657   185624.358    ops/s
GsCollectionsAdapted    thrpt        10  5919254.755   495444.800    ops/s
GsCollectionsFast       thrpt        10  7916079.963  1708778.450    ops/s
AddingToList            thrpt        10  7931479.667   966331.036    ops/s
GuavaImmutable          thrpt        10  9021621.880   845936.861    ops/s

Conclusions

  • If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.

  • The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.

  • You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.

  • The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.

  • The Guava ImmutableSet.copyOf(listInts).asList() method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable and the input list cannot contain nulls.

  • The HashSet method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).

  • The LinkedHashSet approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.

  • Both the HashSet and LinkedHashSet methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foos from a List<Foo>.

  • If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.

  • Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct() call than the way I used, so comments or other answers are of course welcome.

NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.

Community
  • 1
  • 1
Matt Coubrough
  • 3,739
  • 2
  • 26
  • 40
  • 2
    Kudos for testing using JMH. I'd like to make two suggestions for the benchmark inputs. First, use a collection 10x or 100x bigger. 100k integers can easily fit in on-chip cache. Second, use random integers as the input. The interesting property of 100k ints between 0 and 50k is that all hashcode collisions are also duplicates. So the set never contains any collisions. Guava's algorithm is particularly good when there are no collisions. I'd recommend generating random integers and a random number of occurrences (let's say 1 to 4 occurrences) and then shuffling the entire list before using it. – Craig P. Motlin Dec 17 '14 at 04:21
  • 2
    @Craig P. Motlin - I get your point about the hashCode collisions beign dupes, and will try some additional tests when I get the chance, although these tests most closely reflect the distribution of data for my use-case. – Matt Coubrough Dec 17 '14 at 04:47
3

If you're using Eclipse Collections (formerly GS Collections), you can use the method distinct().

ListIterable<Integer> listInts = FastList.newListWith(1, 1, 3, 77, 2, 19, 77, 123, 14, 123);
Assert.assertEquals(
        FastList.newListWith(1, 3, 77, 2, 19, 123, 14),
        listInts.distinct());

The advantage of using distinct() instead of converting to a Set and then back to a List is that distinct() preserves the order of the original List, retaining the first occurrence of each element. It's implemented by using both a Set and a List.

MutableSet<T> seenSoFar = UnifiedSet.newSet();
int size = list.size();
for (int i = 0; i < size; i++)
{
    T item = list.get(i);
    if (seenSoFar.add(item))
    {
        targetCollection.add(item);
    }
}
return targetCollection;

If you cannot convert your original List into a GS Collections type, you can use ListAdapter to get the same API.

MutableList<Integer> distinct = ListAdapter.adapt(integers).distinct();

There's no way to avoid the creation of the Set. Still, UnifiedSet is more efficient than HashSet so there will be some speed benefit.

If all you want is the number of distinct items, it's more efficient to just create a set without creating the list.

Verify.assertSize(7, UnifiedSet.newSet(listInts));

Eclipse Collections 8.0 requires Java 8. Eclipse Collections 7.x works well with Java 8, but only requires Java 5.

Note: I am a committer for Eclipse collections.

Donald Raab
  • 6,458
  • 2
  • 36
  • 44
Craig P. Motlin
  • 26,452
  • 17
  • 99
  • 126
1

You should try new LinkedList(new HashSet(listInts)).

Everv0id
  • 1,862
  • 3
  • 25
  • 47
  • 1
    This is very similar to what I suggested in my question - Is there any reason to prefer a LinkedList rather than an ArrayList? Will this perform better than the Java 8 approach? – Matt Coubrough Dec 13 '14 at 23:54
  • This way takes `O(n)` time which is the best result for this operation. In addition, `LinkedList` has `O(1)` addition-to-tail time in *any* case, unlike `ArrayList` sometimes needs to recreate inner array on addition operation. – Everv0id Dec 14 '14 at 00:39
  • 1
    But surely an ArrayList will reserve the space of the size of the HahsSet passed to the constructor and there will be no add operations at all? – Matt Coubrough Dec 14 '14 at 00:40
  • In this specific case, yes. – Everv0id Dec 14 '14 at 00:44
  • 2
    @Everv0id: `ArrayList` has _amortized_ O(1) add() complexity, which means any sequence of k add operations takes O(k) total time. Additionally, `ArrayList` has so much better constant factors that it's almost never a good idea to use `LinkedList`. – Louis Wasserman Dec 14 '14 at 00:57
  • @LouisWasserman I know that *amortized* complexity for ArrayList is O(1), but dont sure about constants. I think we need to do some benchmarks. Also, there is an opinion that using of `ListIterator` can improve `LinkedList` performance dramatically. – Everv0id Dec 14 '14 at 01:05
  • If the element order of the source list ought to be retained, it should be `LinkedHashSet` rather than `HashSet`. And for constructing the `List` from a `Set`, using `ArrayList` is much more efficient than `LinkedList`. It just calls `toArray()` on the `Set` and uses that array; it can’t be simpler. In contrast, the `LinkedList` has to iterate over all elements and create and link a node object for each. – Holger Dec 14 '14 at 23:48
1

Guava can be your choice:

ImmutableSet<Integer> set = ImmutableSet.copyOf(listInts);

The API is extremely optimized.

It is FASTER than listInts.stream().distinct() and new LinkedHashSet<>(listInts).

Community
  • 1
  • 1
卢声远 Shengyuan Lu
  • 31,208
  • 22
  • 85
  • 130
  • This seems to be the same solution as suggested by Louis Wasserman in the comments and does seem promising from my initial testing... The caveats being that it doesn't handle `null` in `listInts` (not a problem for me) and it returns an *immutable* list (which may or may not be an issue until I think about it some more). – Matt Coubrough Dec 15 '14 at 03:47
  • @MattCoubrough Yes, I ignored Louis Wasserman's comment when I post the answer. – 卢声远 Shengyuan Lu Dec 15 '14 at 04:58
0

When adding a value to a listInts check:

int valueToAdd;
//...
if (!listInts.contains(valueToAdd)) {listInts.add(valueToAdd)}

if you have an existing list, use a for-each statement to copy all the values from that list, to a new one, that you want to be "distinct":

List<Integer> listWithRepeatedValues;
List<Integer> distinctList;
//...
for (Integer i : listWithRepeatedValues) {
    if (!listInts.contains(valueToAdd)) {distinctList.add(i);}
}
Victor2748
  • 4,149
  • 13
  • 52
  • 89
  • 2
    If I'm starting with a List of Integers to begin with, wouldn't this be a horribly inefficient way of creating a distinct list compared to adding them to a set? – Matt Coubrough Dec 13 '14 at 23:41
  • @MattCoubrough I don't think so, when working with lists in Java, that "full-copy" from one list/array to another happens many times – Victor2748 Dec 13 '14 at 23:43
  • @MattCoubrough I wouldn't say that it would be inefficient – Victor2748 Dec 13 '14 at 23:44
  • 2
    List.contains is O(n) whereas HashSet.contains is O(1). Calling an O(n) operation once per `add` instead of an O(1) operation *surely* can't be better than the `new ArrayList<>(set)` approach I suggested in my question can it? Or is the cost of creating the set and then the distinct list really expensive? – Matt Coubrough Dec 13 '14 at 23:52
  • Shouldn't that last line of code read: `if(!distinctList.contains(i)) { distinctList.add(i); }` ? – Matt Coubrough Dec 13 '14 at 23:59
  • @MattCoubrough When you are creating a new Arraylist<>(set), it still copies every element from the set to the new list, so it will not be significantly more performance-expencive than creating a new list and passing the set to the constructor – Victor2748 Dec 13 '14 at 23:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/66830/discussion-between-victor2748-and-matt-coubrough). – Victor2748 Dec 14 '14 at 00:10
0

Don't worry. Using a HashSet is a pretty easy and efficient way to eliminate duplicates:

    Set<Integer> uniqueList = new HashSet<>();
    uniqueList.addAll(listInts);   // Add all elements eliminating duplicates

    for (int n : uniqueList)       // Check the results (in no particular order)
        System.out.println(n);

    System.out.println("Number distinct values: " + uniqueList.size());

In a more specific scenario, just in case the range of possible values is known, is not very large, while listInts is very large.
The most efficient way of counting the number of unique entries in the list that I can think of is:

    boolean[] counterTable = new boolean[124];
    int counter = 0;

    for (int n : listInts)
        if (!counterTable[n]) {
            counter++;
            counterTable[n] = true;
        }

    System.out.println("Number of distinct values: " + counter);
ulix
  • 999
  • 9
  • 13
  • Thanks for the answer, but when referring to the count of distinct elements I specifically meant the total number of unique Integers in the original `listInts` (ie. in this case `uniqueList.size()` ), as opposed to the number of times each individual duplicate integer occurs - sorry if my wording was unclear. – Matt Coubrough Dec 14 '14 at 22:15
0

This should work:

yourlist.stream().map(your wrapper that overrides equals and hashchode method::new).distinct().map(wrapper defined above::method that returns the final output).collect(Collectors.toList());

Righto
  • 855
  • 3
  • 11
  • 32