2

so i was asked this question on an interview. I had to make an algorithm which will sort 1 billion floats. I thought about doing it straightforward by making an array and quicksorting it but it turns out you can't make an array that will take up more than 2bil. bytes.
Yes I am kinda new to c++, but still.
PS in the task it was said that i can use c++14 so maybe there is something that can help?

WhoLeb
  • 29
  • 3
  • 2
    Yeah, but first you need to have enough space on the disk and indeed you need to merge sort it. It is the fastest way to do it and to keep the pace, you need to load that much floats, that they can fit in your current available space in the RAM. A very old code of mine: https://github.com/vkaytsanov/MergeSortOnFile – vikAy Jan 18 '21 at 08:31
  • 2
    _"it turns out you can't make an array that will take up more than 2bil. bytes"_ — Why not? It shouldn't be a problem an a 64-bit system. Anyway, if you want fast sorting, you definitely will need to use a parallel sorting algorithm (basically, a parallel quicksort or mergesort; the former is in-place, which might be necessary for such a large data set). – Daniel Langr Jan 18 '21 at 08:40
  • You absolutely can allocate `2^32` bytes, it's just 4GB. Radix sort might come handy here. – Quimby Jan 18 '21 at 08:41
  • there are normal laptops with more than 16GB of RAM so you can definitely sort 4GB array in memory. How to do it efficiently it's another problem. – bolov Jan 18 '21 at 09:11
  • 1
    I just tested with `std::vector data(1000000000);` and `std::sort(std::execution::par, data.begin(), data.end());`. It took 3.5 seconds on my machine (all zeroes though). ~15 seconds with random data. – Ted Lyngmo Jan 18 '21 at 09:16
  • @TedLyngmo `std::execution::par` isn't available in C++14. But there are some external libraries (such as libstdc++ parallel mode based on OpenMP). – Daniel Langr Jan 18 '21 at 09:36
  • 1
    @DanielLangr Oups, I missed the C++14 requirement. I wonder if external libraries would have been ok though? It's not that hard to make a simple threaded version oneself though. Just split it in X blocks and sort each block, then merge. X would be approx. the number of hardware threads available. If the file is bigger than you can fit in memory, memory map it. – Ted Lyngmo Jan 18 '21 at 09:58
  • @TedLyngmo If you have enough memory, then, yes, merge sort is relatively straightforward. But only until you want to merge in parallel (single-threaded merging might be a bottleneck otherwise). Parallel merge is not that simple. (Similarly, parallelization of a quicksort is relatively simple until you hit the parallel partitioning problem :). – Daniel Langr Jan 18 '21 at 10:04
  • @DanielLangr Indeed. I was only thinking about `std::sort`ing the X blocks in parallell and merging in one thread which would be pretty simple. To merge the sorted blocks one could `std::merge` 2 sorted blocks in each thread until all are merged, but I don't know how effective that'll be. It's pretty straight forward though. If memory mapped files aren't allowed, one could write an iterator class for that (that only uses standard `fstream`s) to use with the standard algorithms. It may be pretty slow, but it should get the job done. – Ted Lyngmo Jan 18 '21 at 10:20
  • 1
    Regarding the `i was asked this question on an interview` and `but it turns out you can't make an array that will take up more than 2bil. bytes.` raises the question if the limit is given by the one doing the interview. In Interviews one of you tasks is to also ask about the restrictions you have to deal with and further requirements. Without this information one can only tell you how this could be solved but that does not mean that such a solution is what the one that asked you that question had in mind. – t.niese Jan 18 '21 at 13:25
  • @TedLyngmo you also need to care about NaN when sorting floats by a special comparison function – phuclv Mar 20 '21 at 15:04
  • @phuclv The actual comparison function used when using the parallel execution policies I tested above should be the same as when just doing a regular sort (`operator<`) ... or am I missing something? Perhaps it was one of my other half-baked ideas you thought about? :-) – Ted Lyngmo Mar 20 '21 at 15:28
  • @TedLyngmo the normal `operator<` doesn't work for float if there's NaN in the array because NaN comparison doesn't satisfy strick weak ordering (which std::sort requires). [Does greater operator “>” satisfy strict weak ordering?](https://stackoverflow.com/a/58509684/995714), [Sorting a list containing NaNs](https://stackoverflow.com/q/28366298/995714), [sort function breaks in the presence of nan](https://stackoverflow.com/q/4240050/995714) – phuclv Mar 20 '21 at 15:43
  • @phuclv Ok, I thought you meant that there was a difference in what I suggested and a regular sort. I now see that you actually meant that one _should_ use a special comparison function. Sure, if there's a risk of NaN's etc, that sounds good. – Ted Lyngmo Mar 20 '21 at 15:52

0 Answers0