3

Is pushing integers into a vector, and then sorting the entire vector faster or slower than inserting the integers into a set, which sorts as you enter them. Sorry, I'm new to c++, and I'm not sure how to use the clock function. Could anyone help? Any help would be greatly appreciated. Thanks in advance.

#include <vector>
#include <iostream>
#include <set>
#include <algorithm>

using namespace std;

vector<int> possibility1;
set<int> possibility2;

int main()
{
    random_device rd;
    mt19937 rng(rd());
    uniform_int_distribution<int> uni(0,1000);

    // possibility 1
    for(int i = 0; i < 1000000; i++) {
        int r = uni(rng);
        possibility1.push_back(r);
    }

    // possibility 2
    for(int j = 0; j < 1000000; j++) {
        int r = uni(rng);
        possibility2.insert(r);
    }
}

EDIT

For this case, the difference in time doesn't really matter, but if I have large classes with a lot of private variables, and a make a vector/set of objects (with a compare function, too), which one would be faster?

NL628
  • 418
  • 6
  • 21
  • 2
    Depends on your purpose. Different data structures exist because it's not a one size fits all kinda thing – StoryTeller - Unslander Monica Nov 24 '17 at 23:56
  • 3
    What's faster depends on your particular requirements. Write your own benchmarks, using real data that you'll be actually using, and do your own timings. – Ken White Nov 24 '17 at 23:58
  • 4
    Generally, The vector tend to be faster for types that are cheap to copy/move (in particular, if you are using `std::vector::reserve()`), but measuring is the only way to be sure. – MikeMB Nov 24 '17 at 23:58
  • 3
    Doesn't matter all that much what the objects (or the size of one object) is. What matters is access patterns. Does your app process all and then just reads the items? Does it iterate over all of them often? Does it alternate removal and insertion? All of those things factor in. You didn't tell us about any of it. – StoryTeller - Unslander Monica Nov 24 '17 at 23:59
  • 3
    In my tests a sorted *vector* tends to be faster than a *set* (probably due to cache locality). You can create a function to insert into the correctly sorted position rather than sorting all the time. But its obviously more fiddly than just using a set. – Galik Nov 25 '17 at 00:00
  • 3
    Consider this: A `set` is usually implemented as a red-black-tree, using 3 pointers per node in addition to the data. On a 64-bit system this means that 1 million `int`s will likely use 32 MB as a set and 4 MB as a vector. How is that working out for your processor's cache size? – Bo Persson Nov 25 '17 at 02:41
  • 1
    I see, so set is more memory heavy. I am new to stack overflow. Could you tell me if this question is reasonable? If it's not, could you tell me how I would go about solving this question? Thanks. – NL628 Nov 25 '17 at 02:42
  • 2
    The rule of thumb is: "If you don't have a **specific** reason to use another container, prefer `std::vector`". So, if you have to ask... – Bo Persson Nov 25 '17 at 02:52
  • Just generally speaking, if you can do a boatload of insertions and just sort the data afterwards rarely without interleaving some insertions, then sort again, then insert some more, then sort again, `vector` will tend to win out big time for reasons mentioned above (reduced memory use, reduced allocations/deallocations, increased spatial locality). On top of winning initially, it'll give you more breathing room to optimize to your heart's content, since there's lots of room to sort things faster than `std::sort` with parallel sorts, radix sorts, etc with multithreading and reduced generality. –  Nov 29 '17 at 17:58

2 Answers2

4

Is putting integers into a vector, and then sorting faster, or putting integers into a set

Both have same asymptotic complexity O(n log n).

Set is a more complicated data structure, and it is most certainly going to have higher significant order coefficients and thus will be slower than a sorted vector by some factor in practice.

Whether the complexity difference has any significant impact to the performance of the program, depends on whether the size of the data structure is significant enough compared to other sources of complexity. This can be measured with a profiler.


However, if for example you were to need to keep your data structure sorted between insertions, rather than only after all insertions, then set would be asymptotically simpler O(n log n) than sorted vector O(n2).

Asymptotic complexity only guarantees that beyond some input threshold, set will be more efficient. But due to smaller coefficients, vector is probably still faster for inputs that are below the threshold. Many factors affect the threshold, such as hardware. To find out the threshold on a particular system, you can measure with a profiler.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • 1
    Vector would also have the advantage of being cache friendly. – Yashas Nov 25 '17 at 04:02
  • 1
    You don't have to sort the vector with every insertion. You can insert into the correctly sorted position using `std::upper_bound`. – Galik Nov 25 '17 at 04:28
  • 1
    @Galik doesn't really matter much asymptotically speaking. Regardless, you have to shift N (in worst case, N / 2 on average if we assume uniform distribution) elements each insert. That's pretty much same as inserting into back, and doing one iteration of bubble sort downwards. – eerorika Nov 25 '17 at 12:45
1

You can pretty simply determine average run times for a function call (like sort) using the C++ high resolution clock and getting the difference between two time_point's retrieved with now. This answer contains a good explanation of how to use that bit of the chrono library: C++ Cross-Platform High-Resolution Timer

You can use this kind of measurement to determine how long sorting takes in each case, but you should also bear in mind that set sorts as you go whereas a vector is sorted when sort is called on it. If your application can benefit from amortizing the cost across multiple insertions -- for example, you insert elements into the set and access elements in the set in interleaved operations, vs. inserting all elements at once, sorting, then only accessing the sorted collection later -- set might be the right choice. Several commenters have already mentioned that the right choice depends on your application, and it's why the advice is to decide on your own benchmarks and then measure them.

Matt
  • 161
  • 9