Avoid SIGSEGV when subsetting data.frame with call to `[data.frame` in Rcpp

Question

My Rcpp code is occasionally failing (SEGFAULT, etc.) for reasons I don't understand. The code creates a large data.frame, and then tries to obtain a subset of this data.frame by calling the R subset function, [.data.frame), from within the same method that is creating the frame. A very simplified version of it is shown below:

library(Rcpp)
src <- '// R function to subset data.frame - what will be called to subset
DataFrame test() {
Function subsetinR("[.data.frame"); 

// Make a dataframe in Rcpp to subset
size_t n = 100;
auto df =  DataFrame::create(Named("a") = std::vector<double> (n, 2.0),
                             Named("b") = std::vector<double> (n, 4.0));

// Now make a vector to subset with 
LogicalVector filter = LogicalVector::create(n, TRUE);
for (size_t i =0; i < n; i++) {
    if (i % 2 == 0) filter[i] = FALSE;
}   

// Subset, here is where it fails!
df = subsetinR(df, filter, R_MissingArg);
return df; 
}'  

fun <- cppFunction(plugins=c("cpp11"), src, verbose = TRUE, depends="Rcpp") 
fun()

However, while this occasionally works, it will other times it fails with the following error:

*** caught segfault ***
   address 0x7ff700000030, cause 'memory not mapped'`

Anyone know what is going wrong?

Note: This is not a duplicate. I have seen other stack overflow answers which create vectors by exploiting subsetting on each vector, e.g.

  // Next up, create a new DataFrame Object with selected rows subset. 
  return Rcpp::DataFrame::create(Rcpp::Named("val1")  = val1[idx],
                                 Rcpp::Named("val2")  = val2[idx],
                                 Rcpp::Named("val3")  = val3[idx],
                                 Rcpp::Named("val3")  = val4[idx]
                                 );

However, I am explicitly looking to avoid the repeated [idx] subsetting, as the idx is not known when the data.frame is constructed (it is only known at the end), and I am hoping to find a way that doesn't involve repeatedly invoking that. If it's possible to transform the data.frame at the end with one go though, that would work just fine.

Secondly, do *not* use the subset feature in *R*. Instead, try to use the built in subset index for `Rcpp::*Vectors[]`. — coatless, Jul 19 '16 at 01:49
As @coatless calmly stated, C++ != R so your first call of entry from C++ should probably not be an R function. To a first approximation: if you use R code, you end up with R speed. — Dirk Eddelbuettel, Jul 19 '16 at 01:58
@Coatless Thanks for the comment. I saw the earlier question that you marked as a duplicate, but unfortunately that didn't solve my problem. I am actually trying to specifically avoid the [] subset operation on each vector, because there are 10's of columns in the vector which may or may not be loaded, and writing `df["something"] = something[idx]` seemed much less stable than just writing it once. — evolvedmicrobe, Jul 19 '16 at 02:39
@DirkEddelbuettel - this was just a minimal outline of the code, the subset operation is after several hundred lines of heavyweight C++ that I didn't show. It would obviously be very dumb to immediately call an R function from C++. The point is to avoid repeated subset operations, as the indexes to remove are not known when the data.frame is originally constructed. — evolvedmicrobe, Jul 19 '16 at 02:43
Also, thanks for the comments, will explain why it's not a duplicate and turn the sketch into a compiling bit of code — evolvedmicrobe, Jul 19 '16 at 02:44
Well, we have a saying "minimally reproducible example or it did not exists" for a reason. Hard to help in this case. — Dirk Eddelbuettel, Jul 19 '16 at 02:45
@DirkEddelbuettel - Thanks for help again, I've added a reproducible example above. This consistently segfaults on my machine (R 3.3, Rcpp_0.12.5) — evolvedmicrobe, Jul 19 '16 at 18:55
Look at the Rcpp Gallery examples creating a `data.frame` and _return a `data.frame` and not a list_. I also recommend switching to `cppFunction()` and `sourceCpp()` et al -- see the Rcpp Attributes FAQ. — Dirk Eddelbuettel, Jul 19 '16 at 19:07
@DirkEddelbuettel Changes made above to create/return a list, use cppFunction(). It still segfaults right away — evolvedmicrobe, Jul 19 '16 at 21:44
@DirkEddelbuettel I figured it out, would you mind reopening so I can post an answer in case anyone else has the same problem? — evolvedmicrobe, Jul 20 '16 at 01:15
@Coatless similarly could you remove the duplicate question tag? The issue here is that LogicalVector::create(100, TRUE) doesn't make anything close to a vector of size 100 filled with TRUE values, so I think it's quite distinct from the others. Plus, I'd like to point out a great Rcpp solution here : http://kevinushey.github.io/blog/2015/01/24/understanding-data-frame-subsetting/ — evolvedmicrobe, Jul 20 '16 at 01:20
We have to revote to reopen the question. There is one vote left. @DirkEddelbuettel — coatless, Jul 20 '16 at 03:43
Still a duplicate to me in the large sense. We have several posts here, and of course on the Rcpp Gallery, which discuss subsetting. This post seems to have redone the same work again. (By the way I think anybody can vote to reopen AFAIK it does not have to be the original voter.) — Dirk Eddelbuettel, Jul 20 '16 at 11:14
@DirkEddelbuettel - This question was about why this code produces a segfault, which is specific and is not a duplicate. The question was not how to subset a dataframe, for which there are a gazilliion answers (many on stack overflow). The simple solution to this question is that LogicalVector::create(100, TRUE), didn't make a vector of the appropriate size. In any event, my problems solved though so won't bother adding thing for others. — evolvedmicrobe, Jul 21 '16 at 00:20

score 2 · Accepted Answer · answered Jul 22 '16 at 06:21

The problem here is that LogicalVector::create() is not doing what you expect here -- it's returning a vector of length two, with the elements TRUE and TRUE. In other words, your code:

LogicalVector filter = LogicalVector::create(n, TRUE);

generates not a logical vector of length n with values TRUE, but instead a logical vector of length two with the first element being 'truthy' and so TRUE, and the second explicitly TRUE.

You likely intended to just use the regular constructor, e.g. LogicalVector(n, TRUE).

Avoid SIGSEGV when subsetting data.frame with call to `[data.frame` in Rcpp

1 Answers1