Question
Why does my Rcpp
function inside a data.table
join produce a different (& incorrect) result compared to when used outside of a join?
Example
I have two data.table
s, and I want to find the Euclidean distance between each pair of coordinates across both tables.
To do the distance calculation I've defined two functions, one in base R, and the other using Rcpp
.
library(Rcpp)
library(data.table)
rEucDist <- function(x1, y1, x2, y2) return(sqrt((x2 - x1)^2 + (y2 - y1)^2))
cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
NumericVector x2, NumericVector y2){
int n = x1.size();
NumericVector distance(n);
for(int i = 0; i < n; i++){
distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
}
return distance;
}')
dt1 <- data.table(id = rep(1, 6),
seq1 = 1:6,
x = c(1:6),
y = c(1:6))
dt2 <- data.table(id = rep(1, 6),
seq2 = 7:12,
x = c(6:1),
y = c(6:1))
When doing a join first, then calculating the distance, both functions produce the same result
dt_cpp <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_cpp[, dist := cppEucDistance(x, y, i.x, i.y)]
dt_r <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_r[, dist := rEucDist(x, y, i.x, i.y)]
all.equal(dt_cpp$dist, dt_r$dist)
# [1] TRUE
However, if I do the calculation within a join the results differ; the cpp version is incorrect.
dt_cppJoin <- dt1[
dt2,
{ (cppEucDistance(x, y, i.x, i.y)) },
on = "id",
by = .EACHI
]
dt_rJoin <- dt1[
dt2,
{ (rEucDist(x, y, i.x, i.y)) },
on = "id",
by = .EACHI
]
all.equal(dt_cppJoin$V1, dt_rJoin$V1)
# "Mean relative difference: 0.6173913"
## note that the R version of the join is correct
all.equal(dt_r$dist, dt_rJoin$V1)
# [1] TRUE
What is it about the Rcpp
implementation that causes the join version to give a different result?