To address the primary issue, you are getting seemingly random segfaults because your code contains undefined behavior -- specifically, an array boundary violation. Since you noted earlier that you are fairly new to C++, it would be worthwhile for you to read through at least the first answer to this question which discusses this particular mistake. UB can be a slippery concept to wrap your head around for people coming to C or C++ from other languages, primarily because it does not always manifest itself in the form of an error. The behavior is literally undefined, so there is no way to know what the result will be, nor should you expect the behavior to be consistent across platforms, compilers, or even compiler versions.
I will use your leading_na
function to demonstrate, but the max_x_pos
function has the same problem:
// [[Rcpp::export]]
Rcpp::LogicalVector leading_na(Rcpp::IntegerVector x) {
int n = x.size();
Rcpp::LogicalVector leading_na(n);
int i = 0;
while (x[i] == NA_INTEGER) {
// ^^^^
Rcpp::Rcout << i << "\n";
leading_na[i] = TRUE;
i++;
}
return leading_na;
}
Since there isn't anything enforcing the constraint i < n
, when x
contains only NA
elements, the code proceeds to evaluate x[n]
(and possibly subsequent indices), which is undefined. However, this runs just fine on my machine for smaller vectors (I finally managed to make it crash with larger input), which is precisely why UB-related errors can be tough to identify:
leading_na(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# [1] TRUE TRUE TRUE TRUE TRUE
However, when we replace operator[]
with the at()
member function, which performs the same element access, but also does bounds checking at runtime, the error is apparent:
// [[Rcpp::export]]
Rcpp::LogicalVector leading_na2(Rcpp::IntegerVector x) {
int n = x.size();
Rcpp::LogicalVector leading_na(n);
int i = 0;
while (x.at(i) == NA_INTEGER) {
Rcpp::Rcout << i << "\n";
leading_na[i] = TRUE;
i++;
}
return leading_na;
}
and then
leading_na2(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# Error: index out of bounds
Note that the additional bounds checking provided by at
does come at a slight performance cost, since this check happens at runtime, so even though it can be a good idea to use at
instead of operator[]
during the development stages, once your code has been thoroughly tested it is probably a good idea to revert to operator[]
, assuming the better performance is desired.
As for solutions, the first would be to keep the while
loop and simply add a check on the value of i
:
while (i < n && x[i] == NA_INTEGER) {
leading_na[i] = TRUE;
i++;
}
Notice that I wrote i < n && x[i] == NA_INTEGER
and not x[i] == NA_INTEGER && i < n
. Since &&
performs short-circuit evaluation, when i < n
evaluates as false
in the first version, the expression x[i] == NA_INTEGER
is not evaluated -- which is good, because as we have seen, that is undefined behavior.
Another option would be to use a for
loop instead, which tend to do a better job of "reminding" us to check our boundaries, in my experience, at least:
for (int i = 0; i < n && x[i] == NA_INTEGER; i++) {
leading_na[i] = TRUE;
}
The choice to use a while
loop or a for
loop does not really matter in this case, provided that whatever you choose is written correctly.
Yet another option (or two) is to work with iterators rather than indices, in which case you could use either a while
loop or a for
loop:
// [[Rcpp::export]]
Rcpp::LogicalVector leading_na5(Rcpp::IntegerVector x) {
int n = x.size();
Rcpp::LogicalVector leading_na(n);
Rcpp::IntegerVector::const_iterator it_x = x.begin();
Rcpp::LogicalVector::iterator first = leading_na.begin(),
last = leading_na.end();
/*
while (first != last && *it_x++ == NA_INTEGER) {
*first++ = TRUE;
}
*/
for ( ; first != last && *it_x == NA_INTEGER; ++first, ++it_x) {
*first = TRUE;
}
return leading_na;
}
Although iterators are very useful devices, I'm not sure they provide any benefit over manual indexing in this particular case, so I would recommend using one of the first two approaches.
Unrelated to the segfault, there are a few other aspects of your code worth addressing.
- In R,
&&
and ||
perform atomic logical AND and atomic logical OR, respectively, while &
and |
perform vectorized logical AND and vectorized logical OR, respectively. In C++, &&
and ||
behave as they do in R, but &
and |
are (atomic) bitwise AND and (atomic) bitwise OR, respectively. Just by chance, using &
had the same effect as using &&
in your function above, but you will want to fix that since your intention was to use the logical operation, rather than the bitwise counterpart.
- This is more specific to Rcpp / R's C API, but although using
x[i] == NA_INTEGER
does, in fact, test if x[i]
is NA
, not all types behave like this. IIRC, testing anything for equality against NA_REAL
is always false, even NA_REAL == NA_REAL
; and for non-integer arithmetic types (numeric and complex (REALSXP
/ CPLXSXP
)), you most likely also want to be checking if the value is NaN
. Rcpp provides a few different ways to do this depending on the object type. For vectors of any storage type, Rcpp::is_na(x)
will return a logical vector the same size as x
. For atomic values, I typically use Rcpp::traits::is_na<SEXPTYPE>(x[i])
-- REALSXP
for double
, INTSXP
for int
, CPLXSXP
for Rcomplex
, and so on. However, I think you can equivalently use the vector's corresponding static member function, e.g. Rcpp::NumericVector::is_na(x[i])
, etc., in which case you don't need to memorize the various SEXPTYPE
s.
- Strictly speaking there is no
TRUE
or FALSE
in C++ or C; these are (presumably) convenience typedefs provided by R's API, so just be aware that they do not exist outside of R's backend. Of course, feel free to use them in your Rcpp code, as they clearly behave as intended, but most people stick to the standard true
and false
even when working with Rcpp.
- Kind of a nitpick, but your
leading_na
function declares a local variable also named leading_na
, which is slightly confusing, or at least unorthodox.
- Consider using
std::size_t
(standard C++) or R_xlen_t
(R API specific) when working with sizes of objects, such as in this expression: int n = x.size();
. Those are unsigned integer types which should be large enough to store the length of any object, where as int
is a signed integer type which may or may not be sufficient (it usually is). 99.9% of the time the worst that will happen is you will get some additional compiler warnings (not errors) when using int
s rather than the other two in expressions like for (int i = 0; i < x.size(); i++) { // whatever }
. On rare occasion there may be worse repercussions, such as signed integer overflow (which is also undefined behavior), so just be aware of this remote possibility.
This answer kind of turned into a code review / soapbox rant, but hopefully you find some useful information in there.