1

Im trying to make the switch from R to c++ coding. If you choose to down vote this question, at the very least patronize me with an answer so I can learn something. My question is how am I supposed to approach row-wise calculations in c++ once I have passed c++ a dataframe? Conceptually, I understand that once I pass c++ a dataframe, c++ will treat each column as its own vector that I have to explicitly name. Where I am having trouble is setting up a for loop to iterate through the same position of all of the vectors at once, thus functionally emulating a row-wise function in R. I would like to extend this question to the following applications as well:

  1. How would I set up a loop that iterates across a row and returns a vector. Like rowsum in R? There is an example of this in advanced R using a matrix, but the nomenclature doesn't translate to a pile vectors from a dataframe.
  2. How would I set up a loop that iterates across a row and changes the values in each row, and return the modified vectors?
  3. How would I set up a loop that iterates through a range of rows at once, thus making a sliding window function? like this:

    ## an example of a for loop in R that I want to recapitulate in c++
    output <- list() 
    
    for(i in 1:nrow(df)){
      end_row <- i+3
      df_tmp <- df[i:end_row, ]
      ## do some function here
      output[[i]] <- list(df_tmp)
    }
    
  4. How would I setup the same rolling function in question 3, but in a way that allows me to conditionally extend the vector lengths? In R, Ive written functions using apply that iterate over a range of rows, and then return a list of new dataframes that I then turn into a large dataframe. Doing this one vector at a time is conceptually perplexing to me at the moment.

Lets say I have this dataframe in R

#example data    
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)

In c++, I have gotten this far:

#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

  // access the columns
  IntegerVector a = df["a"];
  IntegerVector b = df["b"];
  CharacterVector c = df["c"];
  IntegerVector d = df["d"];
  CharacterVector e = df["e"];

// write the for loop. I'm attempting to define a single
//position and then apply it to all vectors... 
//but no versions of this approach have worked.   

  for(int i=0; i < a.length(); ++i){

  // do some function
  }
  // return a new data frame
  return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}

I've been following the Advanced R section on this. The part I'm struggling to grasp is the multiple vector four loop construction, and how to define my range iterators. Based on my code, that is your interpretation too? Do I need to create an iterator for each vector, or can I simply define one position based on the length of one vector and then apply to all vectors?

The easiest way for me to move past this is to see an example. Once I see an example of functional code, I'll be able to apply the concepts Ive been reading about.

Edit: would it be possible to add some examples like this to the RCPP documentation? I imagine many people struggle at this step. Considering the dataframe is one of the most common r data containers, I think the rcpp documentation would be greatly strengthened by a couple more dataframe examples - the conceptual switch is not trivial at first glance.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
Phil_T
  • 942
  • 9
  • 27
  • We call it Rcpp. Not rcpp, also not RCPP. If you find the documentation wanting, decent pull requests are always welcome as are contributions to the [Rcpp Gallery](http://gallery.rcpp.org). – Dirk Eddelbuettel Jan 07 '19 at 19:03
  • 1
    See https://stackoverflow.com/questions/22828361/rcpp-function-to-select-and-to-return-a-sub-dataframe – G. Grothendieck Jan 07 '19 at 19:20
  • @G.Grothendieck I've seen that post. It deals with sub setting, which is simple enough. Are you suggesting I need to subset, and then write a function to perform a calculation? – Phil_T Jan 07 '19 at 19:35
  • @DirkEddelbuettel Sorry for the nomenclature error. I have searched the Rcpp Gallery, and I have failed to find what I'm looking for. Would you mind posting a link? I must not be searching this database correctly. – Phil_T Jan 07 '19 at 19:39

1 Answers1

2

I am not convinced that you will gain performance from going to C++ here. However, if you have a set of vectors with equal length (data.frameguarantees that) then you can simply iterate with one index:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

  // access the columns
  IntegerVector a = df["a"];
  IntegerVector b = df["b"];
  CharacterVector c = df["c"];
  NumericVector d = df["d"];
  CharacterVector e = df["e"];

  for(int i=0; i < df.nrow(); ++i){
    a(i) += 1;
    b(i) += 2;
    c(i) += "c";
    d(i) += 3;
    e(i) += "e";
  }
  // return a new data frame
  return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}
/*** R
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)
modifyDataFrame(df)  
*/

Result:

> modifyDataFrame(df)  
   a  b     c    d  e
1  1  3 chr1c 13.2 ae
2  3  5 chr1c 13.2 te
3  5  7 chr1c  7.3 te
4  7  9 chr1c  7.3 ge
5  9 11 chr1c  6.4 ce
6 11 13 chr1c 10.9 ae

Here I am using the nrow()method of the DataFrameclass, c.f. the Rcpp API. This uses R's C API, just as the length() method. I just find it more logical to use a DataFrame-method than single out one of the vectors to retrieve the length. The result would be the same.

As for a sliding window I would look into the RcppRoll package first.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • I can use df.nrow()?! Huh, that's simple. Is this an Rcpp only command? Does this call back to R in any way? I've read that calling back to R can drastically hinder the performance of c++. If so, is there a better approach? – Phil_T Jan 07 '19 at 19:43
  • Also, how would I specify a range of i+4 to create a "sliding" window like function? – Phil_T Jan 07 '19 at 19:46
  • @Phil_T Please see the amended answer. – Ralf Stubner Jan 07 '19 at 19:59
  • I need to write custom functions. Inside the "rolling window" framework, I'll be doing a lot of linear algebra, text parsing, and calculations of that nature. RcppRoll does not allow me to do this, hence why I'm trying to get around it. The stuff Ive written using the zoo package is painfully slow. Would you mind giving me a nudge in the right direction so I can have something to start from? – Phil_T Jan 07 '19 at 20:15
  • I think I have a grasp on this. To create the sliding window, I'll need to dynamically create new sub vectors based on my range in question. Something like this: https://stackoverflow.com/questions/421573/best-way-to-extract-a-subvector-from-a-vector. Then I need to perform my calculations, and likely output each iteration as a list of vectors. Aggregate all my lists of vectors into a list of lists of vectors, and send this back to R. Once back in R, turn into a dataframe using as.data.frame()... or so I think. – Phil_T Jan 07 '19 at 20:58
  • 2
    @Phil_T You only need to extract sub vectors if you want to use Rcpp-sugar functions on them. Otherwise you can use a double loop, one with output length the other with window length. Sometimes you can also use a single loop (e.g. for rolling sum) by adding onevalue and removing another. – Ralf Stubner Jan 07 '19 at 23:19
  • @Phil_T Don't despair, and don't get mad, but a `data.frame` simply _is_ harder at the C(++) level as it is "just" a list of equal-length vectors. There are no magic row operations you are missing, and we don't have magic shortcuts. What @Ralf showed was about as good as it gets (outside of Rcpp Sugar on vectors and related tricks). – Dirk Eddelbuettel Jan 08 '19 at 02:05
  • @DirkEddelbuettel No anger on my end. More like c++ awe and impatience. I like this "no magic" type of programming. Its a way of thinking that I have been able to lazily avoid by programming in R. – Phil_T Jan 09 '19 at 02:40
  • All good then. Sometimes it is just different as these are different languages... – Dirk Eddelbuettel Jan 09 '19 at 03:01