How to remove all duplicates from a list?

Question

Consider this test app:

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  // How to implement this function?
end;

var
  Enumerable: IEnumerable<Integer>;
  UniqueEnumerable: IEnumerable<Integer>;
begin
  Enumerable := TCollections.CreateList<Integer>([1, 1, 2, 3, 3, 3, 4]);
  UniqueEnumerable := RemoveDuplicates(Enumerable);
  UniqueEnumerable.ForEach(
    procedure(const I: Integer)
    begin
      WriteLn(I);
    end);
  ReadLn;
end.

How can I implement the RemoveDuplicates function (this is called nub in Haskell)?

Stefan Glienke · Accepted Answer · 2015-09-03T15:21:19.437

Use what is already there:

uses
  Spring.Collections,
  Spring.collections.Extensions;

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  Result := TDistinctIterator<Integer>.Create(Input, nil);
end;

This supports lazy evaluation (means that Input is not getting processed before the resulting enumerable is getting processed). It is using a hashset (currently implemented as Dictionary) internally to keep track of the already found items (this happens inside the enumerator).

Why is that important? Because any operation that does a complete enumeration might cause unwanted performance impact if Input involves other costly operations which may by far outweigh any benefits of other approaches of removing duplicates (like putting it into a list and sort that). Also an IEnumerable is not guaranteed to be finite.

If between calling this function and enumerating the result the Input was changed that change is affecting the outcome of your enumeration whereas it would not be the case if you are not supporting lazy evaluation. If you are enumerating multiple times the outcome might be different (i.e. up-to-date) each time.

Johan · Answer 2 · 2015-09-03T16:25:07.923

4

Jens's solution will work, but it has a rather slow running time of O(n²).

A better alternative if you have a long list is to
- Sort the list
- Compare every item with its successor.

This has a running time of O(n log n) for the quicksort + O(n) for the search for a total running time of O(n log n).

See the following pseudo code (don't have access to Delphi now).

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  List: IList<Integer>;
  i: integer;
begin
  List := TCollections.CreateList<Integer>;
  List.Assign(Input); //Copy input list to output.
  List.Sort;
  for i:= List.Count-1 downto 1 do begin
    if List[i] = List[i-1] then List.delete(i); 
    //if Comparer<T>.Equals(List[i], List[i-1]) then ....
  end; {for i}
end;

Problems
The problem with this approach is that the output (might) have a different order from the input. This may or may not be a problem.

Benefits (or why the dictionary sucks)
If the sorting is a cheap operation this will be the fastest approach.
The use of a dictionary carries high constant cost for the hashing.
Even though the hashing operation is O(1), it can get very expensive for large keys, because the hash will always process the whole key, whereas sorting comparison will stop as soon a a difference is detected. Further note that hashing is a much more expensive operation than a simple comparision (about 30x to 100x slower)!

Only when the list is huge does the better asymptotic running time of the dictonary kick in.

edited Sep 03 '15 at 16:25

answered Sep 03 '15 at 13:10

Johan

74,508
24
191
319

What hash function are you using on an integer? Or, in other words, much of what you have written here is plain wrong in this setting. You may have a point where hashing is expensive, but that's not the case here. No matter how expensive it is, for decently large data, well before RAM is full, you can expect hashing to win. – David Heffernan Sep 03 '15 at 15:42
@david I was rather assuming that integer is just used as an example here. If you use the default bobjenkins lookup3 hash it will be quite slow even for an integer. – Johan Sep 03 '15 at 16:20
There are certainly many commonly used types which can be hashed efficiently. So "why the dictionary sucks" is a rather over dramatic and perhaps mis-leading statement. In my view. And isn't the default hash for an integer the integer itself? Surely that is true. Hmm, no it is not true. I wonder why not. – David Heffernan Sep 03 '15 at 16:23
No system.generics.defaults pushes everything though the bobjenkins' lookup3 hash. Hence my assertion about the slowness. Using lookup3 makes sense if you're only using part of the integer (e.g. 16 out of 32) much fewer collisions that way. – Johan Sep 03 '15 at 16:27
Even so, I bet that for even modestly sized data the log n soon overtakes the hash cost. – David Heffernan Sep 03 '15 at 16:30
This is why I've put faster hashes in fastdefaults. Will do a test to see when the standard and optimized hash wins out over sorting+dedup. – Johan Sep 03 '15 at 16:33
1

FWIW I recently solved a dedup perf issue with hashed lookup. I'm not even sure I compared with sorting. The thing is, usually, when the data is small enough for sorting to be fast, you don't care about the perf. Everything happens instantly. The asymptotic behaviour is invariably matters because when the log n bites, it really hurts. – David Heffernan Sep 03 '15 at 16:44

Wosi · Answer 3 · 2015-09-03T12:56:43.593

3

For performance reasons I suggest to use a ~~sorted list~~ dictionary.

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  Dictionary: IDictionary<integer, integer>;
  Item: integer;
begin
  Dictionary := TCollections.CreateDictionary<integer,integer>;
  for Item in Input do
    Dictionary.AddOrSetValue(Item, 0);     

  Result := Dictionary.Keys;
end;

edited Sep 03 '15 at 12:56

answered Sep 03 '15 at 12:23

Wosi

41,986
17
75
82

1

A dictionary is O(n). Your approach is O(n log n). – David Heffernan Sep 03 '15 at 12:25
@DavidHeffernan Sorry, I don't get the point. How can a dictionary lookup for duplicates with a complexity of O(n)? – Wosi Sep 03 '15 at 12:35
1

Because dictionary lookup is `O(1)` and there are `n` items to process. – David Heffernan Sep 03 '15 at 12:39
2

Note that the dictionary has a high constant in its O(1) lookup time, so it is not as performant as you might think! – Johan Sep 03 '15 at 13:11

Jens Mühlenhoff · Answer 4 · 2015-09-03T17:26:51.410

0

Using an intermediate list:

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  List: IList<Integer>;
begin
  List := TCollections.CreateList<Integer>;
  Input.ForEach(
    procedure(const I: Integer)
    begin
      if not List.Contains(I) then
        List.Add(I);
    end);
  Result := List;
end;

This is obviously not the best performing solution, see the other answers for better alternatives.

edited Sep 03 '15 at 17:26

answered Sep 03 '15 at 12:01

Jens Mühlenhoff

14,565
6
56
113

1

This is the slowest possible approach and will have a horrible running time even for small lists. – Johan Sep 03 '15 at 13:24
Sure, but it's also the most straight-forward way. I was mainly posting it as a reference basis. – Jens Mühlenhoff Sep 03 '15 at 17:23

How to remove all duplicates from a list?

4 Answers4