As an exercise, I'm trying to use Rcpp and C++ to get grouping indices, much like what dplyr::group_by
provides. These are the row numbers (starting from 0) corresponding to each group in the data.
Here's an example of what the indices would look like.
x <- sample(1:3, 10, TRUE)
x
# [1] 3 3 3 1 3 1 3 2 3 2
df <- data.frame(x)
attr(dplyr::group_by(df, x), "indices")
#[[1]]
#[1] 3 5
#
#[[2]]
#[1] 7 9
#
#[[3]]
#[1] 0 1 2 4 6 8
So far, using the standard library's std::unordered_multimap
, I've come up with the following:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<int> rowvec;
// [[Rcpp::export]]
std::vector<rowvec> rowlist(std::vector<int> x)
{
std::unordered_multimap<int, int> rowmap;
for (size_t i = 0; i < x.size(); i++)
{
rowmap.insert({ x[i], i });
}
std::vector<rowvec> rowlst;
for (size_t i = 0; i < rowmap.bucket_count(); i++)
{
if (rowmap.begin(i) != rowmap.end(i))
{
rowvec v(rowmap.count(i));
int b = 0;
for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
{
v[b] = it->second;
}
rowlst.push_back(v);
}
}
return rowlst;
}
Running this on a single variable results in
rowlist(x)
#[[1]]
#[1] 5 3
#
#[[2]]
#[1] 9 7
#
#[[3]]
#[1] 8 6 4 2 1 0
Other than the reversed ordering, this looks good. However, I can't figure out how to extend this to handle:
- Different data types; the type is currently hardcoded into the function
- More than one grouping variable
(std::unordered_multimap
is also pretty slow compared to what group_by
does, but I'll deal with that later.) Any help would be appreciated.