C++ - vector version implement of argsort low effiency compared to the one in numpy

Question

Here is a comparison I made. np.argsort was timed on a float32 ndarray consists of 1,000,000 elements.

In [1]: import numpy as np

In [2]: a = np.random.randn(1000000)

In [3]: a = a.astype(np.float32)

In [4]: %timeit np.argsort(a)
86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And here is a C++ program do the same procedure but on vectors referring to this answer.

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  std::vector<float> numbers;
  for (int i = 0; i != 1000000; ++i) {
    numbers.push_back((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(numbers.size());
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

It prints Finished in 525.908 milliseconds. and it is far slower than the numpy version. So could anyone explain what makes np.argsort so fast? Thanks.

Edit1: np.__version__ returns 1.15.0 which runs on Python 3.6.6 |Anaconda custom (64-bit) and g++ --version prints 8.2.0. Operating system is Manjaro Linux.

Edit2: I treid to compile with -O2 and -O3 flags in g++ and I got result within 216.515 miliseconds and 205.017 miliseconds. That is an improve but still slower than numpy version. (Referring to this question) This was deleted because I mistakenly run the test with my laptop's DC adapter unplugged, which would cause it slow down. In a fair competition, C-array and vector version perform equally (take about 100ms).

Edit3: Another approach would be to replace vector with C like array: float numbers[1000000];. After that the running time is about 100ms(+/-5ms). Full code here:

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  //std::vector<float> numbers;
  float numbers[1000000];
  for (int i = 0; i != 1000000; ++i) {
    numbers[i] = ((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(1000000);
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

aren't you supposed to be just timing the `sort` in the `c++` version? — EdChum, Aug 23 '18 at 09:24
@EdChum But vector idx should be regarded as a container of sort result, so I think it’s initialize procession should add into timing — Page David, Aug 23 '18 at 09:50
Platform, compiler, numpy version (and with what options it was built)? Did you compile with optimizations? Trying this locally, I get 122ms for Numpy, 115ms for c++.... (release build of course, debug is order of magnitude slower) — Dan Mašek, Aug 23 '18 at 09:50
@DanMašek Could you please share your setup or configuration? I really wonder if it is my unfamiliarity with compiler caused the problem. — Page David, Aug 23 '18 at 11:11
wow. Did not assume that classic array is so much faster, resp. that STL `vector` has such an overhead. — rocksportrocker, Aug 23 '18 at 11:29
@PageDavid Win10, MSVC2013 64bit. Switching to pointers makes no difference in release build here. My guess would be that the stdlib that g++ uses has some additional checks in vector operator[]... — Dan Mašek, Aug 23 '18 at 11:55
Now, I think it is the STL vector caused this problem, so I changed the title to be more like a C++ question. And, maybe, `vector` is not quite a choice for performance or speed. — Page David, Aug 23 '18 at 12:16
Now I'm getting nearly identical run times (~107-110 ms) with python and C++ (the original version) using clang++ and libc++. g++ is about 10% slower. — n. m. could be an AI, Aug 23 '18 at 17:10
FWIW: vector and C-array versions are equally fast (130ms) and only marginally slower than numpy (120ms) when compiled ` -O2` and gcc-5.4. vector and C-array being equally fast is what I would expect when build is optimized. — ead, Aug 24 '18 at 06:31
Numpy has a sort-version optimized for argsort (https://github.com/numpy/numpy/blob/464f79eb1d05bf938d16b49da1c39a4e02506fa3/numpy/core/src/npysort/quicksort.c.src#L543), so it can be expected that it performs slightly better than a general sort function which is used by your c++-version. — ead, Aug 24 '18 at 06:48
I found myself have made a really naive mistake on timing. My laptop, maybe for battery saving, will slow down if DC adapter was unplugged. Sorry for the misleading info previously and now in a fair environment, vector and C-array perform equally as @ead said. — Page David, Aug 24 '18 at 07:13
@ead I noticed that there are `np.aquicksort` and `np.quicksort` in the file you linked. Does that prefix `a` stands for `argsort` version? If so, could you please explain briefly what optimization it made? And it cannot be better if you provide an answer for that. — Page David, Aug 24 '18 at 07:25
When you know you are sorting based on floats, you can use specialized algorithms for that. https://www.boost.org/doc/libs/1_68_0/libs/sort/doc/html/sort/single_thread/spreadsort/sort_hpp/float_sort.html for instance. `boost::sort::spreadsort::float_sort(idx.begin(),idx.end(),[&numbers](size_t x,unsigned off){return boost::sort::spreadsort::float_mem_cast(numbers[x])>>off;},[&numbers](size_t a,size_t b){return numbers[a] — Marc Glisse, Aug 24 '18 at 10:59
At first glance I don't see any optimization done specially for argsort by numpy. Maybe numpy's quicksort version is just somewhat faster for this kind of input. — ead, Aug 24 '18 at 11:37

score 3 · Accepted Answer · answered Aug 24 '18 at 10:06

3

I took your implementation and measured it with 10000000 items. It took approximated 1.7 seconds.

Now I introduced a class

class valuePair {
  public:
    valuePair(int idx, float value) : idx(idx), value(value){};
    int idx;
    float value;
};

with is initialized as

std::vector<valuePair> pairs;
for (int i = 0; i != 10000000; ++i) {
    pairs.push_back(valuePair(i, (double)rand() / (RAND_MAX)));
}

and the sorting than is done with

std::sort(pairs.begin(), pairs.end(), [&](const valuePair &a, const valuePair &b) { return a.value < b.value; });

This code reduces the runtime down to 1.1 seconds. This is I think due to a better cache coherency, but still quite far away from the python results.

answered Aug 24 '18 at 10:06

schorsch312

5,553
5
28
57

That is because python runs for 10^6 and not 10^7 elements:) – ead Aug 24 '18 at 11:33
But this is a very good observation. I guess argsort is not implemented this way, because this approach needs additional memory (and you are missing step where the indexes are copied to the result-array). – ead Aug 24 '18 at 11:35
I removed one 0 and tested your code locally. Timing was started before `std::vector pairs;`. It finishes in about 85 to 100 ms. – Page David Aug 24 '18 at 11:35
For 10^7 elements, IPython reports `1.67 s ± 47.8 ms per loop`, and the code runs about 910 - 925 ms. Maybe scale of elements should be taken into consideration? – Page David Aug 24 '18 at 11:40
And, here I tried to modify the original version to 10^7 which finished in about 1800 - 1900 ms. For the C-array version, I add one 0 for three number but somehow it core dumped and .. I have nothing to do with that. So, nice work @schorsch312. – Page David Aug 24 '18 at 11:46
@PageDavid Thx! Please consider closing the question. – schorsch312 Aug 24 '18 at 11:55

rocksportrocker · Answer 2 · 2018-08-23T10:41:31.473

1

Ideas:

different underlying algorithm:. np.argsort uses quicksort as default, the implementation in C++ might depend on your compiler.
~~function call overhead: I'm not sure if C++ compilers inline your comparison function. If not, calling this function also might introduce some overhead.~~ not the case according to this post
compiler flags ?

edited Aug 23 '18 at 10:41

answered Aug 23 '18 at 09:30

rocksportrocker

7,251
2
31
48

1

According to https://stackoverflow.com/questions/13722426/why-can-lambdas-be-better-optimized-by-the-compiler-than-plain-functions, I think point two is not the cause of the problem. – Page David Aug 23 '18 at 10:17

C++ - vector version implement of argsort low effiency compared to the one in numpy

2 Answers2