Review cpp code causing segfault

Question

I have some cpp code that's running within an R function that's called about ~80k times. It's test suite is comprehensive and passing. It seems to be running fine for the first 60k times it's called and then somewhere in the middle, I get a segfault.

*** Error in `/usr/lib/R/bin/exec/R': malloc(): memory corruption: 0x00000000047150f0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x77725)[0x7f684462e725]
/lib/x86_64-linux-gnu/libc.so.6(+0x819be)[0x7f68446389be]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f684463a5a4]
/usr/lib/R/lib/libR.so(Rf_allocVector3+0x70d)[0x7f6844cd617d]
... # more

Here's some of my code as an example, can you see anything wrong with it?

Return a LogicalVector (e.g TRUE/FALSE vector) where leading NAs are marked as TRUE

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
LogicalVector leading_na(IntegerVector x) {
  int n = x.size();
  LogicalVector leading_na(n);

  int i = 0;
  while(x[i] == NA_INTEGER) {
    leading_na[i] = TRUE;
    i++;
  }
  return leading_na;
}

Return a LogicalVector (e.g TRUE/FALSE vector) where trailing NAs are marked as TRUE

// [[Rcpp::export]]
LogicalVector trailing_na(IntegerVector x) {
  LogicalVector trailing_na = leading_na(rev(x));
  return rev(trailing_na);
}

Copies the functionality of na.locf(x, na.rm = TRUE) from the zoo package:

// [[Rcpp::export]]
IntegerVector na_locf(IntegerVector x) {
  int n = x.size();
  LogicalVector lna = leading_na(x);

  for(int i = 0; i<n; i++) {
    if((i > 0) & (x[i] == NA_INTEGER) & (lna[i] != TRUE)) {
        x[i] = x[i-1];
      }
  }
  return x;
}

Return the last position in a vector where there was a number:

// [[Rcpp::export]]
int max_x_pos(IntegerVector x) {
  IntegerVector y = rev(x);
  int n = x.size();
  LogicalVector leading_na(n);

  int i = 0;
  while(y[i] == NA_INTEGER) {
    i++;
  }

  return n-i;
}

Did you mean to use *bitwise* AND here: `(i > 0) & (x[i] == NA_INTEGER) & (lna[i] != TRUE)`? Because *logical* AND is `&&` in C++. — nrussell, Aug 28 '16 at 02:31
And how are you calling your `max_x_pos` function? Because this line is not well defined -- `while (y[i] == NA_INTEGER) { i++; }` -- when your input is all `NA`. To see this, change it to `while (y.at(i) == NA_INTEGER) { i++; }` to get bounds checking. When you call `max_x_pos(c(NA, NA))` you will get `Error: index out of bounds`. You should add a restriction on `i`, e.g. `while (i < n && y[i] == NA_INTEGER) { i++; }`. I think this may be causing your segfault. — nrussell, Aug 28 '16 at 02:56
Actually, your `leading_na` function has the same mistake described above. — nrussell, Aug 28 '16 at 03:06
I'm going to change the nature of the question so you can get credit for the changes you've recommended. Please convert your comments to an answer. — Brandon Bertelsen, Aug 28 '16 at 22:23
Please also change the Subject: as it most likely a common programming bug as opposed to a 'memory corruption'. — Dirk Eddelbuettel, Aug 28 '16 at 22:31
@DirkEddelbuettel Thanks, after changing it, I'm actually not sure what to entitle it. Do you have a recommendation? — Brandon Bertelsen, Aug 28 '16 at 22:36
How about a generic "help with intermittent errors" or something. Being on the other side of this package I am a wee bit sensitive to overly inflated severity levels such as "memory corruption" or "printer on fire" -- at least until *you* demonstrate that we set the printer on fire. — Dirk Eddelbuettel, Aug 28 '16 at 22:38
Ah sorry about that, it wasn't meant as Rcpp was the problem. I see how you could have read it that way for sure. Title updated. — Brandon Bertelsen, Aug 28 '16 at 22:41

score 6 · Accepted Answer · edited May 23 '17 at 12:22

To address the primary issue, you are getting seemingly random segfaults because your code contains undefined behavior -- specifically, an array boundary violation. Since you noted earlier that you are fairly new to C++, it would be worthwhile for you to read through at least the first answer to this question which discusses this particular mistake. UB can be a slippery concept to wrap your head around for people coming to C or C++ from other languages, primarily because it does not always manifest itself in the form of an error. The behavior is literally undefined, so there is no way to know what the result will be, nor should you expect the behavior to be consistent across platforms, compilers, or even compiler versions.

I will use your leading_na function to demonstrate, but the max_x_pos function has the same problem:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    int i = 0;
    while (x[i] == NA_INTEGER) {
        // ^^^^  
        Rcpp::Rcout << i << "\n";

        leading_na[i] = TRUE;
        i++;
    }

    return leading_na;
}

Since there isn't anything enforcing the constraint i < n, when x contains only NA elements, the code proceeds to evaluate x[n] (and possibly subsequent indices), which is undefined. However, this runs just fine on my machine for smaller vectors (I finally managed to make it crash with larger input), which is precisely why UB-related errors can be tough to identify:

leading_na(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# [1] TRUE TRUE TRUE TRUE TRUE

However, when we replace operator[] with the at() member function, which performs the same element access, but also does bounds checking at runtime, the error is apparent:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na2(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    int i = 0;
    while (x.at(i) == NA_INTEGER) {
        Rcpp::Rcout << i << "\n";

        leading_na[i] = TRUE;
        i++;
    }

    return leading_na;
}

and then

leading_na2(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# Error: index out of bounds

Note that the additional bounds checking provided by at does come at a slight performance cost, since this check happens at runtime, so even though it can be a good idea to use at instead of operator[] during the development stages, once your code has been thoroughly tested it is probably a good idea to revert to operator[], assuming the better performance is desired.

As for solutions, the first would be to keep the while loop and simply add a check on the value of i:

while (i < n && x[i] == NA_INTEGER) {
    leading_na[i] = TRUE;
    i++;
}

Notice that I wrote i < n && x[i] == NA_INTEGER and not x[i] == NA_INTEGER && i < n. Since && performs short-circuit evaluation, when i < n evaluates as false in the first version, the expression x[i] == NA_INTEGER is not evaluated -- which is good, because as we have seen, that is undefined behavior.

Another option would be to use a for loop instead, which tend to do a better job of "reminding" us to check our boundaries, in my experience, at least:

for (int i = 0; i < n && x[i] == NA_INTEGER; i++) {
    leading_na[i] = TRUE;
}

The choice to use a while loop or a for loop does not really matter in this case, provided that whatever you choose is written correctly.

Yet another option (or two) is to work with iterators rather than indices, in which case you could use either a while loop or a for loop:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na5(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    Rcpp::IntegerVector::const_iterator it_x = x.begin();
    Rcpp::LogicalVector::iterator first = leading_na.begin(),
        last = leading_na.end();

/*
    while (first != last && *it_x++ == NA_INTEGER) {
        *first++ = TRUE;
    }
*/

    for ( ; first != last && *it_x == NA_INTEGER; ++first, ++it_x) {
        *first = TRUE;
    }

    return leading_na;
}

Although iterators are very useful devices, I'm not sure they provide any benefit over manual indexing in this particular case, so I would recommend using one of the first two approaches.

Unrelated to the segfault, there are a few other aspects of your code worth addressing.

In R, && and || perform atomic logical AND and atomic logical OR, respectively, while & and | perform vectorized logical AND and vectorized logical OR, respectively. In C++, && and || behave as they do in R, but & and | are (atomic) bitwise AND and (atomic) bitwise OR, respectively. Just by chance, using & had the same effect as using && in your function above, but you will want to fix that since your intention was to use the logical operation, rather than the bitwise counterpart.
This is more specific to Rcpp / R's C API, but although using x[i] == NA_INTEGER does, in fact, test if x[i] is NA, not all types behave like this. IIRC, testing anything for equality against NA_REAL is always false, even NA_REAL == NA_REAL; and for non-integer arithmetic types (numeric and complex (REALSXP / CPLXSXP)), you most likely also want to be checking if the value is NaN. Rcpp provides a few different ways to do this depending on the object type. For vectors of any storage type, Rcpp::is_na(x) will return a logical vector the same size as x. For atomic values, I typically use Rcpp::traits::is_na<SEXPTYPE>(x[i]) -- REALSXP for double, INTSXP for int, CPLXSXP for Rcomplex, and so on. However, I think you can equivalently use the vector's corresponding static member function, e.g. Rcpp::NumericVector::is_na(x[i]), etc., in which case you don't need to memorize the various SEXPTYPEs.
Strictly speaking there is no TRUE or FALSE in C++ or C; these are (presumably) convenience typedefs provided by R's API, so just be aware that they do not exist outside of R's backend. Of course, feel free to use them in your Rcpp code, as they clearly behave as intended, but most people stick to the standard true and false even when working with Rcpp.
Kind of a nitpick, but your leading_na function declares a local variable also named leading_na, which is slightly confusing, or at least unorthodox.
Consider using std::size_t (standard C++) or R_xlen_t (R API specific) when working with sizes of objects, such as in this expression: int n = x.size();. Those are unsigned integer types which should be large enough to store the length of any object, where as int is a signed integer type which may or may not be sufficient (it usually is). 99.9% of the time the worst that will happen is you will get some additional compiler warnings (not errors) when using ints rather than the other two in expressions like for (int i = 0; i < x.size(); i++) { // whatever }. On rare occasion there may be worse repercussions, such as signed integer overflow (which is also undefined behavior), so just be aware of this remote possibility.

This answer kind of turned into a code review / soapbox rant, but hopefully you find some useful information in there.

Thank you -- and for the edit as well. It will probably take me a couple more passes to glean out the rest of the typos, but that's what I get for writing an essay I suppose... — nrussell, Aug 29 '16 at 00:38
I see that you recommened using a for loop instead of a while loop. My thought was that a for loop would run through the whole vector, where as the while loop would do only one if the condition was matched. For very long vectors, could this have a performance impact? — Brandon Bertelsen, Aug 29 '16 at 03:04
As long as you have both checks in place -- `i < n && x[i] == NA_INTEGER;` -- then the `for` loop will not run through the whole vector, i.e. `for (int i = 0; i < n && x[i] == NA_INTEGER; i++) { }` is equivalent to `while (i < n && x[i] == NA_INTEGER) { i++ }`. [Here is an example](https://gist.github.com/nathan-russell/d64881d07d14b86ef280f3942225c2eb). **If** the `for` loop were traversing the vector unnecessarily then yes, that would likely hurt performance, but that is not the case. For the record I think either type of loop is fine in this case. — nrussell, Aug 29 '16 at 03:35

Review cpp code causing segfault

1 Answers1