-2

Given a data.frame with multiple columns, what is the fastest way to count the combination of value in the columns using rcpp but not solely R to ensure better performance?

For example, consider the following data.frame called df, with columns A,B,C,D,E

     A  B  C  D  E
  1  1  1  1  1  2 
  2  1  1  1  1  2
  3  2  2  2  2  3
  4  2  2  2  2  3 
  5  3  3  3  3  1

Expected output is as follows:

     A  B  C  D  E count
  1  1  1  1  1  2 2
  2  2  2  2  2  3 2
  3  3  3  3  3  1 1

In R, it can be done by creating a new column that combines existing columns and use table to find the count, that is:

df$combine <- do.call(paste, c(df, sep = "-"))
tab <- as.data.frame(table(df$combine))

Because performance of data massage and the table command in R is a bit slow, does any anybody know and speedy way that do the same in Rcpp?

  • [This](http://stackoverflow.com/questions/18201074/find-how-many-times-duplicated-rows-repeat-in-r-data-frame) answer may help. – R. Schifini Jun 12 '16 at 04:18

1 Answers1

0

Okay, here is one way I can think of it.

First of all, we really cannot use the Rcpp::DataFrame object type in Rcpp as it really is a loose list of vectors. So, I've lowered the threshold for this problem by creating a Rcpp::NumericMatrix that matches the sampled data. From here, can use a std::map to count unique rows. This is simplified since the Rcpp::NumericMatrix has a .row() attribute enabling subset by row. So, each row is then converted to a std::vector<T>, which is used as a key for the map. Then, we add each std::vector<T> to the std::map and increment its count value. Lastly, we export the std::map to the desired matrix format.

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericMatrix unique_rows( Rcpp::NumericMatrix & v)
{

  // Initialize a map
  std::map<std::vector<double>, int> count_rows;

  // Clear map
  count_rows.clear();

  // Count each element
  for (int i = 0; i != v.nrow(); ++i) {
    // Pop from R Matrix
    Rcpp::NumericVector a = v.row(i);
    // Convert R vector to STD vector
    std::vector<double> b = Rcpp::as< std::vector<double> >(a);

    // Add to map
    count_rows[ b ] += 1;
  }

  // Make output matrix
  Rcpp::NumericMatrix o(count_rows.size(), v.ncol()+1);

  // Hold count iteration
  unsigned int count = 0;

  // Start at the 1st element and move to the last element in the map.
  for( std::map<std::vector<double>,int>::iterator it = count_rows.begin();
       it != count_rows.end(); ++it )
  {

    // Grab the key of the matrix
    std::vector<double> temp_o = it->first;

    // Tack on the vector, probably can be speed up. 
    temp_o.push_back(it->second);

    // Convert from std::vector to Rcpp::NumericVector
    Rcpp::NumericVector mm = Rcpp::wrap(temp_o);

    // Store in a NumericMatrix
    o.row(count) = mm;

    count++;
  }

  return o;
}

Then we go with:

a = matrix(c(1, 1, 1, 1, 2, 
1, 1, 1, 1, 2,
2, 2, 2, 2, 3,
2, 2, 2, 2, 3, 
3, 3, 3, 3, 1), ncol = 5, byrow = T)


unique_rows(a)

Giving:

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    1    2    2
[2,]    2    2    2    2    3    2
[3,]    3    3    3    3    1    1
coatless
  • 20,011
  • 13
  • 69
  • 84
  • I get the sense, rather vaguely given my level, that unique_rows <- function(x) { require (rcpp) and then your code above results in the Giving: but for those of us trying to make the transition at home... – Chris Jun 12 '16 at 09:47
  • @nrussell do you have a better way to go about this? – coatless Jun 12 '16 at 18:35
  • @Chris, I'm not sure if you are a troll account or if this is a serious question. If it is the later, see: [`Rcpp::sourceCpp()`](http://www.inside-r.org/packages/cran/rcpp/docs/sourceCpp) – coatless Jun 12 '16 at 18:36
  • Thank you. I will read it with interest. What, by the way, is a troll? – Chris Jun 12 '16 at 18:44
  • @Coatless, many thank for your reply. I've just started to use rcpp and c++ a few days ago and I don't think I can think of your solution. Cheers. – user2460415 Jun 14 '16 at 17:06