2

What is the algorithmic (time) complexity of std::next and std::nth_element for a std::multiset in C++? I am interested in the latest gcc implementation, in particular, if it matters. I have a set of numbers in std::multiset and I need to compute their median. I have a feeling that it should be possible to do it in O(log N) for a balanced binary tree (or even in O(1) if the tree is perfectly balanced and, so the median element is at the root of the tree).

using Set = std::multiset<double>;
const Set x = {/*some numbers*/};
if(x.empty()) return 0;
const Set::size_type size = x.size();
const Set::size_type size2 = size/2;
const Set::const_iterator it = std::next(x.begin(), size2);
return size%2==1 ? *it : 0.5*(*std::next(it,-1) + *it);

If the numbers were in a std::vector, I could do it in O(1). To find Nth element in a std::multiset I use std::next, but I am thinking if std::nth_elment could be better (I hope they both use the fact that std::multiset is sorted).

If what I am doing to compute the median is not optimal, please, advise on a better way to do it.

Thank you very much for your help!

S.V
  • 2,149
  • 2
  • 18
  • 41
  • 2
    [`std::nth_element`](https://en.cppreference.com/w/cpp/algorithm/nth_element) requires RandomAccessIterator, so it's cannot be used with `std::multiset`. – Yksisarvinen Mar 16 '22 at 22:21
  • You said “if it were in a vector I could…”. Why not use a vector? – Taekahn Mar 16 '22 at 22:25
  • I do not use a `vector` since I need to compute the median as quickly as possible when it is needed. So, while data is collected, I can much faster maintain it sorted (ready for median computation) state in a `multiset` (insert has `O(log N)` complexity) than in a `vector` (insert has `O(N)` complexity). – S.V Mar 16 '22 at 22:32
  • Every optimization has trade-offs. Have you considered other ways of calculating the median? I.e. keeping an additional variable around that is the current running total? – Taekahn Mar 16 '22 at 22:44
  • [`boost::flat_multiset`](https://www.boost.org/doc/libs/1_78_0/doc/html/boost/container/flat_multiset.html) would give you O(1) median, at the cost of O(log(c) + m) inserts, where c is element comparisons and m is element moves – Caleth Mar 16 '22 at 22:45
  • @Caleth unless I’m mistaken, that isn’t any different than maintaining a sorted vector? If we assume random inserts, `m == n` and dominates the `log(c)` factor? No? – Taekahn Mar 16 '22 at 22:56
  • 1
    @Taekahn it *is* a sorted vector. `m` and `c` measure different operations. if you weigh them equally then it's linear, yes – Caleth Mar 16 '22 at 22:59
  • @S.V keeping a vector sorted while progressively adding data is also very efficient due to the cache locality nature of vectors. That's why [Bjarne Stroustrup says we must avoid linked lists](https://stackoverflow.com/q/34170566/995714) even when randomly inserting data because benchmarks has proved that [vector is still faster in that case](https://kjellkod.wordpress.com/2012/02/25/why-you-should-never-ever-ever-use-linked-list-in-your-code-again/) most of the time – phuclv Mar 17 '22 at 03:10
  • @phuclv Assuming **large** number of elements, do you think frequent inserting into a sorted `vector` has a real chance to be faster than into a `multiset`? This goes by a large factor against the theoretical complexities of the operations `O(N)` vs `O(log N)`. – S.V Mar 17 '22 at 16:41

1 Answers1

1

std::next is O(k) in the distance you move.

There is no other way to get the median, practically, if you have a multiset.

It is possible to write a more efficient advance operation on containers like this, but std does not contain it.

You could keep around two multisets and balance them using splicing operations such that the median is always the first element of the 2nd container. This will have asymptotically good performance characteristics.

It could make other operations a bit more of a pain.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524