The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of OrderVisitor
, which exhibit the following interface:
class OrderVisitor {
public:
virtual ~OrderVisitor(){}
/** are the elements at indices i and j equal */
virtual bool equal(int i, int j) const = 0 ;
/** is the i element less than the j element */
virtual bool before( int i, int j) const = 0 ;
virtual SEXP get() = 0 ;
} ;
dplyr then has implementations of OrderVisitor
for all types we are supporting in this file and we have a dispatcher function order_visitor
that makes an OrderVisitor*
from a vector.
With this, we can store a set of vector visitors into a std::vector<OrderVisitor*>
; The OrderVisitors has a constructor taking a DataFrame
and a CharacterVector
of names of vectors we want to use for the ordering.
OrderVisitors o(data, names ) ;
Then we can use the OrderVisitors.apply
method which essentially does lexicographic ordering:
IntegerVector index = o.apply() ;
The apply
method is implemented by simply initializing an IntegerVector
with 0..n
and then std::sort
it according to the visitors.
inline Rcpp::IntegerVector OrderVisitors::apply() const {
IntegerVector x = seq(0, nrows -1 ) ;
std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ;
return x ;
}
The relevant thing here is how the OrderVisitors_Compare
class implements operator()(int,int)
:
inline bool operator()(int i, int j) const {
if( i == j ) return false ;
for( int k=0; k<n; k++)
if( ! obj.visitors[k]->equal(i,j) )
return obj.visitors[k]->before(i, j ) ;
return i < j ;
}
So at this point index
gives us the integer indices of the sorted data, we just have to make a new DataFrame
from data
by subsetting data
with these indices. For this we have another kind of visitors, encapsulated in the DataFrameVisitors
class. We first create a DataFrameVisitors
:
DataFrameVisitors visitors( data ) ;
This encapsulates a std::vector<VectorVisitor*>
. Each of these VectorVisitor*
knows how to subset itself with an integer vector index. This is used from DataFrameVisitors.subset
:
template <typename Container>
DataFrame subset( const Container& index, const CharacterVector& classes ) const {
List out(nvisitors);
for( int k=0; k<nvisitors; k++){
out[k] = get(k)->subset(index) ;
}
structure( out, Rf_length(out[0]) , classes) ;
return (SEXP)out ;
}
To wrap this up, here is a simple function using tools developped in dplyr:
#include <dplyr.h>
// [[Rcpp::depends(dplyr)]]
using namespace Rcpp ;
using namespace dplyr ;
// [[Rcpp::export]]
DataFrame myFunc(DataFrame data, CharacterVector names) {
OrderVisitors o(data, names ) ;
IntegerVector index = o.apply() ;
DataFrameVisitors visitors( data ) ;
DataFrame res = visitors.subset(index, "data.frame" ) ;
return res ;
}