How to stop std::set from sorting?

Question

I have an array of names, but I require only the unique ones. I use std::set so that it clears out duplicate. Yet I need the name appear in the same order as input. That means if my input is:

Mary
Mary
John
John
John
Apple
Apple
Apple

[Edit]: After checking comments/answers, I want to emphasis that each name appears in group and does not show up later on in the input. Refer to example, Mary appears two times and that is. It does not show up again later on.[/Edit]

I want my output to be:

Mary
John
Apple

Using std::set, I get the sorted one:

Apple
John
Mary

I find out there is unordered_set (from {cplusplus.com}). This one again does not keep the input order.

Question:

Is there a way to stop the std::set from sorting?
I have read that {one can write own's sorting method for std::set}. Now if I cannot stop the set from sorting, how about writing my own sorting method, but always return the first element of input as the smallest? (If I can get thru the detail on how to do it...)
Or is there other thing in std that can reduce a group of strings into a unique set, but does not sort it?

Thanks!

1. No. 2. wouldn't work. 3. `std::vector`, check for duplicates before inserting new elements. — juanchopanza, Sep 01 '14 at 09:39
Are all of your duplicate elements guaranteed to be consecutive, as in your example input? If so, then use [`std::unique`](http://en.cppreference.com/w/cpp/algorithm/unique) — Benjamin Lindley, Sep 01 '14 at 09:41
You could use [Boost.MultiIndex](http://www.boost.org/doc/libs/1_56_0/libs/multi_index/doc/index.html) with a sequenced index and a unique hashed (or ordered) index. — Angew is no longer proud of SO, Sep 01 '14 at 09:43
The main question is wrong: a std::set sorts and doesn't store duplicates. period. So the question should be: how can I have a container without duplicates while keeping the original order? — stefaanv, Sep 01 '14 at 09:59

score 6 · Answer 1 · answered Sep 01 '14 at 10:10

6

The simplest thing is to keep 2 collections, the vector and the set (or unordered_set). This will consume more memory but will use the set to check for duplicates (in O(log N) time) and the vector to maintain order.

The set can also alternatively contain the position in the vector of the item and have as a predicate v[i] < v[j]. Slightly complicated as you'd need to store a reference/pointer to your vector in the special predicate. However it can be done and will use potentially less memory as you only have one collection of strings and the other is of ints. In addition it acts as an index, being able to quickly locate where a particular item is.

answered Sep 01 '14 at 10:10

CashCow

30,981
5
61
92

+1 / a `vector of `const_iterator`s may be easier than the `set` referencing the `vector`.... – Tony Delroy Sep 01 '14 at 10:28
1

The *simplest* thing is just a vector, with on O(N) search before insertion. Maybe that's fast enough. – Mike Seymour Sep 01 '14 at 10:30
1

Yes for a small collection linear may be fast enough but then if it's small the space constraint is not likely to be an issue either. – CashCow Sep 01 '14 at 10:39
Why use two collections? Why not use `std::vector` and `std::unique` together (just like my answer)? – Just a HK developer Sep 02 '14 at 05:52
The problem of a set of (const) iterators is that iterators get invalidated as you add items to a vector when it has to allocate more memory. Of course you could use deque instead of vector. Either way, your predicate needs to also be able to check the value of the item (as well as position / iterator) unless you plan to push it then pop it straight back again if it's a duplicate. – CashCow Jan 05 '17 at 16:38

score 3 · Accepted Answer · answered Sep 02 '14 at 05:50

After reading all comments and answers, I think the most direct way to answer my own question is to use std::vector and std::unique.

Point to note is:

I have a list of names that is small. Should not be more than 2000 names.
Each name appears in a cluster. If Mary appears 2 times, it will not appear any more in the rest of of the list.
I only need to get a set of unique names, but keep the initial ordering.
After getting that unique set, I do not need to do any more operation (insert/remove/etc) to the set.

So here is my coding:

#include <vector>

int main()
{
    std::vector<std::string> names;
    std::vector<std::string>::iterator last;
    std::vector<std::string>::iterator it;

    names.push_back("Mary");
    names.push_back("Mary");
    names.push_back("John");
    names.push_back("John");
    names.push_back("John");
    names.push_back("Apple");
    names.push_back("Apple");
    names.push_back("Apple");

    last = std::unique(names.begin(), names.end());
    for (it = names.begin(); it != last; ++it)
        std::cout << *it << endl;
}

And so the output will be (what I want):

Mary
John
Apple

That is it. Thanks for those contributing. Feel free to comment, especially about the efficiency.

std::unique will only work if they are grouped, i.e. all the items that are the same are together. And if you "sort" them first - well you know what will happen... not quite what you want. — CashCow, Jan 09 '17 at 15:51

score 1 · Answer 3 · edited Sep 02 '14 at 03:56

You're attempting to change a fundamental design implementation. Instead you should probably rethink your own design and not try to go against the grain of the standard library.

My solution would be to use a std::vector<std::string> and depending on what your program aims to do either:

Check for a duplicate before pushing onto the vector

or

Create a function to return a new vector of the unique names

Either of these implementations will retain the insert order and you'll be able to handle duplicates on your own terms.

Here is the second version:

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> collection;

std::vector<std::string> getUniques(std::vector<std::string> collection)
{
    std::vector<std::string> uniques;
    for (std::string name : collection)
    {
        if (std::find(uniques.begin(), uniques.end(), name) == uniques.end())
            uniques.push_back(name);
    }

    return uniques;
}

int main()
{
    collection.push_back("John");
    collection.push_back("John");
    collection.push_back("Sally");
    collection.push_back("Kent");
    collection.push_back("Jim");
    collection.push_back("Sally");

    std::vector<std::string> uniques = getUniques(collection);

    for (std::string name : uniques)
        std::cout << name << std::endl;
}

Yields:

John
Sally
Kent
Jim

1) My original data does not have `Sally` repeat at the end. Each name appears in cluster. 2) There is [`std::unique`](http://en.cppreference.com/w/cpp/algorithm/unique) which you may want to check out. — Just a HK developer, Sep 02 '14 at 05:54
The solution i suggested would handle the data the same even if there are only sequential duplicates. And I'm also aware of unique but realize if you decide to use it that it will be more work to change your code later on if you decide you need the full data set including the duplicates as std:: unique will alter the original collection. — Nick Savage, Sep 02 '14 at 07:05

score 0 · Answer 4 · answered Sep 01 '14 at 09:54

First question: No. As per cplusplus.com:

Sets are containers that store unique elements following a specific order.

Second question: you would need to have 2 points of data to do that. The first would be your actual string, the second would be a sort of 'insertion index', so you can store the order of insertion.

So basically, you could do it if you put std::pair in your std::set and basically increment the number you put in the std::pair. However, once you do that, it means each std::pair will be unique, meaning the use of 'std::set' is gone.

The above already sounds way too complicated, so why not go with a more suitable container? You could for example use an std::vector and remove doubles upon insertion.

If that's too slow (O(N) insertion), you could have an std::vector for the in-order storage, and keep an std::set next to it to be able to quickly check uniqueness.

score 0 · Answer 5 · answered Sep 01 '14 at 10:33

0

From your example, it seems that the equal values follow each other.

If this is the case, there is no need for sophistication: you can start filling a new array and copy the elements one by one unless they are the same as the previous. This is a simple O(N) process.

answered Sep 01 '14 at 10:33

Yes, equal values follow each other. I thought about your suggestion before posting my question, but I don't like it to take O(N). – Just a HK developer Sep 02 '14 at 02:50
1

If you can do that faster then O(N), you're up to the Nobel prize of Computer Science. – Sep 02 '14 at 06:12

score 0 · Answer 6 · answered Sep 01 '14 at 11:27

0

Instead of std::set use std::unique

#include <iostream>
#include <algorithm>
#include <vector>
#include <cstring>

using namespace std;

bool myfunction (char *i,char *j) 
{
    int x=strcmp(i,j);
    if(!x)
        return 1;
    else
        return 0;
}

int main () 
{
  char mywords[][10] = {"Mary","Mary","John","John","John","Apple","Apple","Apple"};
  vector<char*> myvector (mywords,mywords+8);
  vector<char*>::iterator it;
  it = unique (myvector.begin(), myvector.end(), myfunction);
  myvector.resize(distance(myvector.begin(),it));

  cout << "Output:";
  for (it=myvector.begin(); it!=myvector.end(); ++it)
    cout << ' ' << *it;
  cout << endl;

  return 0;
}

answered Sep 01 '14 at 11:27

Milan Patel

422
4
12

It will work as long as they are grouped, it won't work if they are not grouped. – CashCow Sep 01 '14 at 16:09
@CashCow - Yes, that's true but user asked for a sequence which has the name appearing in the same order as input and has no duplicates. So for input: A A A B B A A C C Output:A B A C. If he wants to remove duplicates from entire list then my code wont work. – Milan Patel Sep 01 '14 at 20:24

How to stop std::set from sorting?

6 Answers6