Duplicating dplyr::group_by's indices functionality with Rcpp

Question

As an exercise, I'm trying to use Rcpp and C++ to get grouping indices, much like what dplyr::group_by provides. These are the row numbers (starting from 0) corresponding to each group in the data.

Here's an example of what the indices would look like.

x <- sample(1:3, 10, TRUE)
x
# [1] 3 3 3 1 3 1 3 2 3 2

df <- data.frame(x)
attr(dplyr::group_by(df, x), "indices")
#[[1]]
#[1] 3 5
#
#[[2]]
#[1] 7 9
#
#[[3]]
#[1] 0 1 2 4 6 8

So far, using the standard library's std::unordered_multimap, I've come up with the following:

// [[Rcpp::plugins(cpp11)]]

#include <Rcpp.h>
using namespace Rcpp;

typedef std::vector<int> rowvec;

// [[Rcpp::export]]
std::vector<rowvec> rowlist(std::vector<int> x)
{
    std::unordered_multimap<int, int> rowmap;
    for (size_t i = 0; i < x.size(); i++)
    {
        rowmap.insert({ x[i], i });
    }

    std::vector<rowvec> rowlst;
    for (size_t i = 0; i < rowmap.bucket_count(); i++)
    {
        if (rowmap.begin(i) != rowmap.end(i))
        {
            rowvec v(rowmap.count(i));
            int b = 0;
            for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
            {
               v[b] = it->second;
            }
            rowlst.push_back(v);
        }
    }
    return rowlst;
}

Running this on a single variable results in

rowlist(x)
#[[1]]
#[1] 5 3
#
#[[2]]
#[1] 9 7
#
#[[3]]
#[1] 8 6 4 2 1 0

Other than the reversed ordering, this looks good. However, I can't figure out how to extend this to handle:

Different data types; the type is currently hardcoded into the function
More than one grouping variable

(std::unordered_multimap is also pretty slow compared to what group_by does, but I'll deal with that later.) Any help would be appreciated.

For dealing with different data types, I think you will have to create an additional helper function that determines the type and then use a templated approach on your `rowlist`. — Joseph Wood, Apr 16 '18 at 18:30
The Rcpp Gallery has posts on dynamic dispatching. We have the problem here that the underlying `SEXP` is a union and can hold different types we will only know at run-time. — Dirk Eddelbuettel, Apr 18 '18 at 11:59
@DirkEddelbuettel yes, I'm struggling to duplicate the union in C++ at the moment. Will have a look. — Hong Ooi, Apr 18 '18 at 12:26
I (belatedly) saw your comment in the chat room. That seems ... weird as the `SEXP` already is one. Learn to love the `SEXP` :) Here is a [good post by Kevin on it](http://gallery.rcpp.org/articles/rcpp-wrap-and-recurse/) and here is [another one post doing for sparse matrices](http://gallery.rcpp.org/articles/dynamic-dispatch-for-sparse-matrices/). — Dirk Eddelbuettel, Apr 18 '18 at 12:29

score 2 · Answer 1 · answered Apr 19 '18 at 01:10

I've mulled over this question for quite some time, and my conclusion is that this is going to be quite difficult to say the least. In order to replicate the magic of dplyr::group_by, you are going to have to write several classes, and setup a really slick hashing function to deal with various data types and a differing number of columns. I've scoured the dplyr source code and it looks like if you follow the creation of ChunkMapIndex, you will get a better understanding.

Speaking of data types, I'm not even sure that using std::unordered_multimap can get you what you want as it is unwise and difficult to use double/float data type(s) as your key.

Given all of the challenges mentioned, the code below will produce the same output as attr(dplyr::group_by(df, x), "indices") with integers types. I've set it up to get you started thinking about how to deal with different data types. It uses a templated approach with a helper function as it is an easy an effective solution for dealing with different data types. The helper functions is very similar to the functions in the links Dirk provided.

// [[Rcpp::plugins(cpp11)]]

#include <Rcpp.h>
#include <string>
using namespace Rcpp;

typedef std::vector<int> rowvec;
typedef std::vector<rowvec> rowvec2d;

template <typename T>
rowvec2d rowlist(std::vector<T> x) {

    std::unordered_multimap<T, int> rowmap;
    for (int i = 0; i < x.size(); i++)
        rowmap.insert({ x[i], i });

    rowvec2d rowlst;

    for (int i = 0; i < rowmap.bucket_count(); i++) {
        if (rowmap.begin(i) != rowmap.end(i)) {
            rowvec v(rowmap.count(i));
            int b = 0;
            for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
                v[b] = it->second;

            rowlst.push_back(v);
        }
    }

    return rowlst;
}

template <typename T>
rowvec2d tempList(rowvec2d myList, std::vector<T> v) {

    rowvec2d vecOut;

    if (myList.size() > 0) {
        for (std::size_t i = 0; i < myList.size(); i++) {
            std::vector<T> vecPass(myList[i].size());
            for (std::size_t j = 0; j < myList[i].size(); j++)
                vecPass[j] = v[myList[i][j]];

            rowvec2d vecTemp = rowlist(vecPass);
            for (std::size_t j = 0; j < vecTemp.size(); j++) {
                rowvec myIndex(vecTemp[j].size());
                for (std::size_t k = 0; k < vecTemp[j].size(); k++)
                    myIndex[k] = myList[i][vecTemp[j][k]];

                vecOut.push_back(myIndex);
            }
        }
    } else {
        vecOut = rowlist(v);
    }

    return vecOut;
}

// [[Rcpp::export]]
rowvec2d rowlistMaster(DataFrame myDF) {

    DataFrame::iterator itDF;
    rowvec2d result;

    for (itDF = myDF.begin(); itDF != myDF.end(); itDF++) {
        switch(TYPEOF(*itDF)) {
            case INTSXP: {
                result = tempList(result, as<std::vector<int> >(*itDF));
                break;
            }
            default: {
                stop("v must be of type integer");
            }
        }
    }

    return result;
}

It works with more than one grouping variable, however it is not nearly as fast.

set.seed(101)
x <- sample(1:5, 10^4, TRUE)
y <- sample(1:5, 10^4, TRUE)
w <- sample(1:5, 10^4, TRUE)
z <- sample(1:5, 10^4, TRUE)
df <- data.frame(x,y,w,z)

identical(attr(dplyr::group_by(df, x, y, w, z), "indices"), rowlistMaster(df))
[1] TRUE

library(microbenchmark)
microbenchmark(dplyr = attr(dplyr::group_by(df, x, y, w, z), "indices"),
               challenge = rowlistMaster(df))
Unit: milliseconds
     expr       min        lq       mean     median         uq        max neval
    dplyr  2.693624  2.900009   3.324274   3.192952   3.535927   6.827423   100
challenge 53.905133 70.091335 123.131484 141.414806 149.923166 190.010468   100

Yeah I don't know what magic @Romain Francois worked with the grouping code, but it's potent. I'm not worried about keying on doubles: 1) it's also what `group_by` does; 2) if people do that with `group_by` then that's presumably what they want. — Hong Ooi, Apr 19 '18 at 02:01

Duplicating dplyr::group_by's indices functionality with Rcpp

1 Answers1