12

looking at the rcpp documentation and Rcpp::DataFrame in the gallery I realized that I didn't know how to modify a DataFrame by reference. Googling a bit I found this post on SO and this post on the archive. There is nothing obvious so I suspect I miss something big like "It is already the case because" or "it does not make sense because".

I tried the following which compiled but the data.frame object passed to updateDFByRef in R stayed untouched

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void updateDFByRef(DataFrame& df) {
    int N = df.nrows();
    NumericVector newCol(N,1.);
    df["newCol"] = newCol;
    return;
}
statquant
  • 13,672
  • 21
  • 91
  • 162

2 Answers2

14

The way DataFrame::operator[] is implemented indeed leeds to a copy when you do that:

df["newCol"] = newCol;

To do what you want, you need to consider what a data frame is, a list of vectors, with certain attributes. Then you can grab data from the original, by copying the vectors (the pointers, not their content).

Something like this does it. It is a little more work, but not that hard.

// [[Rcpp::export]]
List updateDFByRef(DataFrame& df, std::string name) {
    int nr = df.nrows(), nc= df.size() ;
    NumericVector newCol(nr,1.);
    List out(nc+1) ;
    CharacterVector onames = df.attr("names") ;
    CharacterVector names( nc + 1 ) ;
    for( int i=0; i<nc; i++) {
        out[i] = df[i] ;
        names[i] = onames[i] ;
    }
    out[nc] = newCol ;
    names[nc] = name ;
    out.attr("class") = df.attr("class") ;
    out.attr("row.names") = df.attr("row.names") ;
    out.attr("names") = names ;
    return out ;
}

There are issues associated with this approach. Your original data frame and the one you created share the same vectors and so bad things can happen. So only use this if you know what you are doing.

Romain Francois
  • 17,432
  • 3
  • 51
  • 77
  • Thank you very much, that's clearer now, I think I was was missing basic knowledge like the fact that `SEXP` were already references. I'll have a look to http://cran.r-project.org/doc/manuals/r-release/R-ints.pdf. For now I was "preparing" the `data.table` in R adding an extra column and updated it within Rcpp, so no copy (I think) was done. That's much better, I understand the risks doing this but it is fine for what I am doing. Merci beaucoup. – statquant Apr 01 '13 at 17:12
  • 1
    Once again, you are just plain wrong: _I think I was was missing basic knowledge like the fact that SEXP were already references_. Not references, but pointers. Try to look up what the last letter in SEXP stands for. – Dirk Eddelbuettel Apr 01 '13 at 19:10
2

The short answers is "because it makes no sense".

A data.frame is essentially a list of vectors. A few seconds of reflection makes it clear that adding a new column to that list entails a copy. So you alter your variable df in the example, do not return it and hence loose the modification.

Merely wishing for something to work a certain way is not always enough.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • So replacing `void` by `SEXP` for example an `return df;` will do the trick ? – agstudy Mar 31 '13 at 15:53
  • Yes, any of SEXP, Rcpp::List or Rcpp::DataFrame will do. Currently seems to come back as a list, so we seem to loose the data.frame-ness of it. – Dirk Eddelbuettel Mar 31 '13 at 15:54
  • thanks..I just test it and it works like a charm! Adding `as.data.frame` to the result give me the desired data.frame-ness... – agstudy Mar 31 '13 at 15:58
  • That's pretty much what I do, and did with `Rcpp::List` before `Rcpp::Date.Frame` was added. Good enough in almost all cases. – Dirk Eddelbuettel Mar 31 '13 at 16:02
  • @Dirk, you're wrong, `data.table` provides ways to add columns to a `data.frame` without any copy, and `data.table` is a daughter class of `data.frame`, so it is already possible, I was hoping for something simpler, so I asked. @agstudy, problem with returning the `data.frame` like in the example is that building a `data.frame` back might be very costly, see this post http://gallery.rcpp.org/articles/faster-data-frame-creation/ for a workaround, or my EDIT with `data.table` – statquant Mar 31 '13 at 16:33
  • 5
    Very nice. So there is your little project to get famous: port the data.table extension to Rcpp. But please stop making unfounded assumptions about the Rcpp code base: What you state is still wrong within Rcpp, which is what your question was about. If you'd rather use data.table, go for it. – Dirk Eddelbuettel Mar 31 '13 at 16:47
  • 2
    Sure! by now I'll try to only ask questions when I already figured out the answer so I won't make unfounded assumptions anymore ! – statquant Mar 31 '13 at 17:20
  • You once again misunderstood. You assertion was in the comment here, and is still wrong. Your question was fine, if naive and lacking of any finer understanding, but still did not call for cross-posting. – Dirk Eddelbuettel Mar 31 '13 at 17:43
  • 6
    @statquant Dirk wrote (with Romain) Rcpp. If he says you can't do what you want *in Rcpp*, you'd do well to take his advice as gospel. Note he doesn't say it *can't be done* full-stop - he was responding specifically to the Rcpp nature of your question. – Gavin Simpson Mar 31 '13 at 18:09