8

I have a HUGE table (about 50Gb) in (i,j,k) format (from a sparse matrix) stored as

uint32_t * idx1, * idx2;
float * vals;
uint32_t tablesize;

and I'd like to sort it in place with a given comparison function that's a function of idx1 and idx2. Can this be done using std::sort?

Specifically, each nonzero entry (i,j) with value v in the sparse matrix is stored by placing i in idx1, j in idx2, and v in the corresponding entry in vals. I'd like to then sort these entries according to (i1, j1, v1) <= (i2, j2, v2) if

(i1 < i2) || (i1==i2 && j1 <= j2)

The examples I've been able to scrounge up of using std::sort on nonstandard datatypes assume that each item being compared is a single instance of a class; here each item is represented by three values in different arrays.

TemplateRex
  • 69,038
  • 19
  • 164
  • 304
AatG
  • 685
  • 8
  • 23
  • In a nutshell, look at the 3 argument version of `std::sort`, and then look up `functor` or `function object`. – PaulMcKenzie Nov 11 '14 at 23:06
  • 1
    So you need to help up -- if I give you two (i,j,k) values values tell us how you would determine if the first value comes before the second value. Also what form is this "table"? You need to tell us a little bit more detail as to how this data is structured. – PaulMcKenzie Nov 11 '14 at 23:11
  • 1
    So you want all three arrays sorted? The easiest would be to combine them all into a `struct` and just have a single array of that type. – Jonathan Potter Nov 11 '14 at 23:12
  • 1
    I don't see how to use the three argument version of sort. What would the RandomAccessIterator be here? That's what I meant about nonstandard datatypes: examples are available for how to deal with this kind of problem if your values are stored as a single class object as Jonathan recommended (this is infeasible for me, since constructing such a struct as a second copy of my data I'd run out of memory), but nothing about if your data is distributed across several arrays. – AatG Nov 11 '14 at 23:50

2 Answers2

3

It is unfortunately quite difficult to convince std::sort, or any of the standard library, to work with striped data. It is designed to assume that data can be copied via a single =, moved via one move or swapped via one swap.

Your best bet is to use boost::iterator_facade to write a custom iterator class which wraps the data, and hides the striped data format from std::sort. I've wanted to do something similar in the past but my workspace does not allow us to use boost. EDIT: when your facade is dereferenced, it will probably need to create some sort of proxy object that can be assigned/moved/swapped and will do the right thing to each of the stripe arrays. It's not trivial.

The next best bet is to make an array of ints from zero to N, each one representing an index into your striped data array. Write a custom functor to std::sort which sorts this array to match your criteria. It's obviously far from ideal when you have such a large data set.

StilesCrisis
  • 15,972
  • 4
  • 39
  • 62
  • 3
    I think this answer is closest to what you want, but you may want to consider rolling your own sort optimized for truly huge arrays; 50 GB of data, even if its in RAM is better treated with external sorting algorithms to exploit locality of access better; it's also a fair bet that you'll benefit from a parallel sort as well. – Stefan Atev Nov 12 '14 at 03:22
  • Good point. I was answering "can you use `std::sort` here" and not "*should* you". All the gymnastics I've listed above are likely going to be more effort than just copy-pasting a basic `qsort` implementation and tweaking it to suit your needs, and the hand-tweaked implementation is likely to be faster as well. – StilesCrisis Nov 12 '14 at 06:22
  • An iterator that returns a proxy when dereferenced is not a RandomAccessIterator, since ForwardIterators (and hence anything more refined) are required to return `value_type&` or `const value_type&` when dereferenced. (per [forward.iterators]/1.3) – Casey Nov 12 '14 at 15:37
1

If you have to keep using your existing data structure, which is essentially a std::tuple of three std::vectors, using boost::zip_iterator would seem to be the way to go. A zip_iterator treats three iterators (two to indices and one to a value) as a single tuple, and you can use a custom comparison function object to sort your data in-place. Alas, boost::zip_iterator can't be used with std::sort, as explained in this Q&A, because it can't be written into.

This means that you would have to write your own zip_iterator class that can be used with std::sort. Note that it is not a trivial exercise, see this Q&A and/or this paper.

It is a lot easier to sort a std::vector of a std::tuple. My attempt below uses a std::tuple of two indices and a value, and stores those entries into a std::vector. For the sorting, I use a C++14 generic lambda that forwards the two indices into a smaller tuple and compares those lexicographically (i.e. first on row-index, then on column-index) using the library operator< of std::tuple.

#include <algorithm>
#include <iostream>
#include <tuple>
#include <vector>

using index = uint32_t;
using value = float;
using sparse_entry = std::tuple<index, index, value>;
using sparse_matrix = std::vector<sparse_entry>;

int main()
{
    // sparse 3x3 matrix
    auto m = sparse_matrix { 
        std::make_tuple( 1, 1, -2.2), 
        std::make_tuple( 1, 0, 42  ), 
        std::make_tuple( 0, 2,  3.4), 
        std::make_tuple( 0, 1,  1.7) 
    };    

    // sort by row-index, then column-index
    std::sort(begin(m), end(m), [](auto const& L, auto const& R) {
        return 
            std::forward_as_tuple(std::get<0>(L), std::get<1>(L)) <
            std::forward_as_tuple(std::get<0>(R), std::get<1>(R))
        ;
    });

    for (auto const& elem : m)
        std::cout << "{ " << std::get<0>(elem) << ", " << std::get<1>(elem) << ", " << std::get<2>(elem) << "}, \n";
}

Live Example.

If your application can use this transformed data layout (and there maybe cache-performance reasons why it can't), then the above code will do the sorting as you want it.

NOTE: as @Casey mentions, you can also use std::tie instead of std::forward_as_tuple, but that can bite you when you change sparse_entry into a full-fledged user-defined class with getters returning by-value.

Community
  • 1
  • 1
TemplateRex
  • 69,038
  • 19
  • 164
  • 304
  • `std::tie` is a bit quicker to type than `std::forward_as_tuple`, and has the same effect in this instance. – Casey Nov 12 '14 at 15:19
  • @Casey `std::tie` only takes lvalues, whereas `std::forward_as_tuple` also works for getters returning by-value. Also, it used to be the case that `std::tie` was not `constexpr`, so I have fallen into the habit of not using `std::tie`. – TemplateRex Nov 12 '14 at 15:21
  • There are no overloads of `std::get` that return by value, and it's nearly impossible to extend since (a) it's forbidden to overload functions in `std`, and (b) it's not possible to partially specialize functions. It would also be fairly perverse for such an extension to return by value when passed an lvalue. In any case, I did edit my comment to add "in this instance" ;) – Casey Nov 12 '14 at 15:26
  • @Casey updated my answer, tnx! btw, with getters you'd write something like `std::tie(L.row(), L.column())` instead of `std::get` of course – TemplateRex Nov 12 '14 at 15:28
  • Ah, I see. I mistook "getters" in your comment to mean flavors of `std::get`. That makes more sense now. – Casey Nov 12 '14 at 15:33
  • I thought this answer was brilliant, but then I looked into it and found that `zip_iterator` cannot be written into. Such a shame! I wanted it to be true. http://stackoverflow.com/questions/9343846/boost-zip-iterator-and-stdsort – StilesCrisis Nov 12 '14 at 19:25
  • @StilesCrisis Thanks for pointing out that `boost::zip_iterator` can't be written into. Fortunately, the solution in the cited paper by Anthony Williams can. Updated the answer, thanks again. – TemplateRex Nov 12 '14 at 20:49
  • I ended up implementing this; it works way faster than the custom mergesort I had written... but is std::sort guarantee to work in-place? – AatG Nov 12 '14 at 23:07
  • @AlexGittens yes, `std::sort` is in-place, unlike `std::stable_sort` – TemplateRex Nov 13 '14 at 09:17