How to dedup an unsorted vector within O(nlogn) while keeping its original order?

Question

I was reviewing Data Structure. In the text book, there is a paragraph saying

"The direct approach is to sort it first, then uniquify it just as a sorted vector. Both steps can be finished within O(nlogn)."

then

"However there is a side-effect: the relative order of the items cannot be preserved."

finally

"Actually we can eliminate this side-effect without increasing the time complexity. Please implement it yourself."

Cannot find any clue. I'm not using STL right now while reviewing the text book. All I can imagine is to try to alter something during mergesort.

Anyone who's having a brighter view?

Sample case

Input: 5, 3, 5, 8, 5, 8, 8, 8, 13

Output: 5, 3, 8, 13

You can create an array on indices `[0 1 2 ..i.. n-1]` and sort it with respect to the values of `A[i]`. Then, you can use the sorted array of indices to detect the duplicates, without having modified the order of the original `A[]` array — Damien, Oct 15 '20 at 11:59
[Already asked here](https://stackoverflow.com/questions/64346313/delete-duplicates-from-array-in-c/64346650#comment113801370_64346650) — PaulMcKenzie, Oct 15 '20 at 12:52

Stef · Answer 1 · 2020-10-15T13:01:15.433

1

A naive approach:

Build a copy of your std::vector; in the copy, instead of simple elements, store pairs (element, index) [complexity: O(n)];
Sort the copy [complexity: O(n log n)];
Move duplicates out of the copy to a new data structure [complexity: O(n)];
For every pair (element, index) in the duplicate data structure, remove the element at that index in the original std::vector [complexity: O(n); see for instance the question Erasing multiple objects from a std::vector].

Total complexity: O(n + n log n + n + n) = O(n log n). Sorting is the longest step.

Alternative approaches in python with benchmarks: https://www.peterbe.com/plog/uniqifiers-benchmark (note: these alternative approaches might be different than the sort-then-dedup approach - for instance, some use python's set which supports test for membership in O(1)).

See also these questions:

edited Oct 15 '20 at 13:01

answered Oct 15 '20 at 11:58

Stef

13,242
2
17
28

Nice try bro, except the time complexity issue I guess – Steven Liang Oct 15 '20 at 12:06
@StevenLiang Your guess is pessimistic; I edited with more information about the time complexity. – Stef Oct 15 '20 at 12:11
The referenced unsorted and sorted data structures are both vectors in my book. The text book is talking about Data Structures in C++. When performing the removing step, the worst case optimal complexity can be O(n^2). If a linked list is used, it's rather easy indeed in O(n). – Steven Liang Oct 15 '20 at 12:36
No, removing duplicates in a sorted `std::vector` is O(n) if you do it carefully. – Stef Oct 15 '20 at 12:37
@StevenLiang In fact, that's why we sort it first. Removing duplicates naively from an unsorted vector would be n^2. – Stef Oct 15 '20 at 12:45
I edited the question body in order to improve readability. The text book means keeping the relative order in the unsorted undedupped vector IMO. – Steven Liang Oct 15 '20 at 12:45
@StevenLiang Oh, are you talking about the fourth step? "For every pair (element, index) in the duplicate data structure, remove the element at that index in the original std::vector"? That's also O(n) if you do it carefully. – Stef Oct 15 '20 at 12:47
Yes the last step. Removing an item in the original vector takes O(n), then it gets multiplied by O(n) since the count of duplicates is input-sensitive and not guaranteed. – Steven Liang Oct 15 '20 at 12:54
@StevenLiang: Yes. Your argument shows that there exists an algorithm to perform the last step in worst-case Θ(n^2). However, there also exists a somewhat-less-naive algorithm which is worst-case Θ(n). – Stef Oct 15 '20 at 12:58
@StevenLiang See for instance the question [Erasing multiple objects from a std::vector](https://stackoverflow.com/questions/3487717/erasing-multiple-objects-from-a-stdvector) – Stef Oct 15 '20 at 13:01

How to dedup an unsorted vector within O(nlogn) while keeping its original order?

1 Answers1