0

I was reviewing Data Structure. In the text book, there is a paragraph saying

"The direct approach is to sort it first, then uniquify it just as a sorted vector. Both steps can be finished within O(nlogn)."

then

"However there is a side-effect: the relative order of the items cannot be preserved."

finally

"Actually we can eliminate this side-effect without increasing the time complexity. Please implement it yourself."

Cannot find any clue. I'm not using STL right now while reviewing the text book. All I can imagine is to try to alter something during mergesort.

Anyone who's having a brighter view?

Sample case

Input: 5, 3, 5, 8, 5, 8, 8, 8, 13

Output: 5, 3, 8, 13

Steven Liang
  • 355
  • 3
  • 10
  • 1
    You can create an array on indices `[0 1 2 ..i.. n-1]` and sort it with respect to the values of `A[i]`. Then, you can use the sorted array of indices to detect the duplicates, without having modified the order of the original `A[]` array – Damien Oct 15 '20 at 11:59
  • [Already asked here](https://stackoverflow.com/questions/64346313/delete-duplicates-from-array-in-c/64346650#comment113801370_64346650) – PaulMcKenzie Oct 15 '20 at 12:52

1 Answers1

1

A naive approach:

  • Build a copy of your std::vector; in the copy, instead of simple elements, store pairs (element, index) [complexity: O(n)];
  • Sort the copy [complexity: O(n log n)];
  • Move duplicates out of the copy to a new data structure [complexity: O(n)];
  • For every pair (element, index) in the duplicate data structure, remove the element at that index in the original std::vector [complexity: O(n); see for instance the question Erasing multiple objects from a std::vector].

Total complexity: O(n + n log n + n + n) = O(n log n). Sorting is the longest step.

Alternative approaches in python with benchmarks: https://www.peterbe.com/plog/uniqifiers-benchmark (note: these alternative approaches might be different than the sort-then-dedup approach - for instance, some use python's set which supports test for membership in O(1)).

See also these questions:

Stef
  • 13,242
  • 2
  • 17
  • 28
  • Nice try bro, except the time complexity issue I guess – Steven Liang Oct 15 '20 at 12:06
  • @StevenLiang Your guess is pessimistic; I edited with more information about the time complexity. – Stef Oct 15 '20 at 12:11
  • The referenced unsorted and sorted data structures are both vectors in my book. The text book is talking about Data Structures in C++. When performing the removing step, the worst case optimal complexity can be O(n^2). If a linked list is used, it's rather easy indeed in O(n). – Steven Liang Oct 15 '20 at 12:36
  • No, removing duplicates in a sorted `std::vector` is O(n) if you do it carefully. – Stef Oct 15 '20 at 12:37
  • @StevenLiang In fact, that's why we sort it first. Removing duplicates naively from an unsorted vector would be n^2. – Stef Oct 15 '20 at 12:45
  • I edited the question body in order to improve readability. The text book means keeping the relative order in the unsorted undedupped vector IMO. – Steven Liang Oct 15 '20 at 12:45
  • @StevenLiang Oh, are you talking about the fourth step? "For every pair (element, index) in the duplicate data structure, remove the element at that index in the original std::vector"? That's also O(n) if you do it carefully. – Stef Oct 15 '20 at 12:47
  • Yes the last step. Removing an item in the original vector takes O(n), then it gets multiplied by O(n) since the count of duplicates is input-sensitive and not guaranteed. – Steven Liang Oct 15 '20 at 12:54
  • @StevenLiang: Yes. Your argument shows that there exists an algorithm to perform the last step in worst-case Θ(n^2). However, there also exists a somewhat-less-naive algorithm which is worst-case Θ(n). – Stef Oct 15 '20 at 12:58
  • @StevenLiang See for instance the question [Erasing multiple objects from a std::vector](https://stackoverflow.com/questions/3487717/erasing-multiple-objects-from-a-stdvector) – Stef Oct 15 '20 at 13:01