3

I am trying to add a new column to data frame using RCpp.

In the following codes, I intend to add a "result" column to dataframe, df. But the dataset does not have "result" column after running the codes. Could you tell me what is wrong with them?

R file to call AddNewCol() function.

library(Rcpp)
sourceCpp('AddNewCol.cpp')
AddNewCol( df ,"result")

AddNewCol.cpp

#include <Rcpp.h>
#include<math.h>
using namespace Rcpp;
// [[Rcpp::export]]
void AddNewCol(DataFrame& df, std::string new_var) {
  int maxRow = df.nrows();
  NumericVector vec_x = df["x"];
  NumericVector vec_y = df["y"];
  NumericVector resultvec = NumericVector(maxRow);

  for( int i = 0 ; i < maxRow; i++ ){
    resultvec[i] = vec_x[i] * pow( vec_y[i] , 2 );  
  }
  df[new_var] = resultvec;
}
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
toshi-san
  • 343
  • 2
  • 8

1 Answers1

6

You cannot do it by reference. But if you return the data frame it works:

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame AddNewCol(const DataFrame& df, std::string new_var) {
  NumericVector vec_x = df["x"];
  NumericVector vec_y = df["y"];
  df[new_var] = vec_x * Rcpp::pow(vec_y, 2);
  return df;
}

/*** R
set.seed(42)
df <- data.frame(x = runif(10), y = runif(10))
AddNewCol( df ,"result")
*/

Note that I have taken the liberty to simplify the computation a bit. Result:

> set.seed(42)

> df <- data.frame(x = runif(10), y = runif(10))

> AddNewCol( df ,"result")
           x         y      result
1  0.9148060 0.4577418 0.191677054
2  0.9370754 0.7191123 0.484582715
3  0.2861395 0.9346722 0.249974991
4  0.8304476 0.2554288 0.054181629
5  0.6417455 0.4622928 0.137150421
6  0.5190959 0.9400145 0.458687354
7  0.7365883 0.9782264 0.704861206
8  0.1346666 0.1174874 0.001858841
9  0.6569923 0.4749971 0.148232064
10 0.7050648 0.5603327 0.221371155
Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • Thank you for your answer! Ok, I understand that I cannot modify the passed data frame. Do you think this kind of limitation comes from the abstraction level of RCpp? In other words, if I delve further into something like R.h, can I modify the original dataframe without copying it?? – toshi-san Aug 16 '18 at 14:49
  • 1
    @toshi-san Have a look at what the `data.table` is doing. – Ralf Stubner Aug 16 '18 at 14:52
  • Ok, I will take a look at it! (I come to feel that what I said above can be against R way. I mean (vanilla) R does not usually modify the data itself, so trying to modify it may not be an R way even if at RCpp level.<- This is just a guess of mine...) – toshi-san Aug 16 '18 at 15:00
  • 3
    @toshi-san Regarding **why** this is so, you may be interested in the two answers to [this related question](https://stackoverflow.com/questions/15731106/passing-by-reference-a-data-frame-and-updating-it-with-rcpp). – duckmayr Aug 16 '18 at 20:40
  • @duckmayr Thanks for digging up that question! – Ralf Stubner Aug 16 '18 at 22:29
  • @duckmayr The related question you showed enhances my understanding. Thanks! – toshi-san Aug 17 '18 at 07:06
  • (+1) I'm a little confused: if `const DataFrame& df` promises `df` won't be changed, how can you `df[new_var] = vec_x * Rcpp::pow(vec_y, 2);`? – nalzok Aug 28 '19 at 23:04
  • @nalzok Have a look at these answers https://stackoverflow.com/questions/15731106/passing-by-reference-a-data-frame-and-updating-it-with-rcpp – Ralf Stubner Aug 29 '19 at 04:32
  • @RalfStubner Actually I've read them, which is linked by @duckmay. From my understanding, while `DataFrame&` conventionally means "call by reference", it silently makes a copy when you add a column to the data frame, as it is merely a list of vectors, which cannot be resized. However, they doesn't explain why the `const` qualifier is helpful in this case. – nalzok Aug 29 '19 at 04:41
  • @nalzok The `conts` qualifier is inconsequential in this case. – Ralf Stubner Aug 29 '19 at 09:24