6

Question

Why does my Rcpp function inside a data.table join produce a different (& incorrect) result compared to when used outside of a join?

Example

I have two data.tables, and I want to find the Euclidean distance between each pair of coordinates across both tables.

To do the distance calculation I've defined two functions, one in base R, and the other using Rcpp.

library(Rcpp)
library(data.table)

rEucDist <- function(x1, y1, x2, y2) return(sqrt((x2 - x1)^2 + (y2 - y1)^2))

cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
                                          NumericVector x2, NumericVector y2){

            int n = x1.size();
            NumericVector distance(n);

            for(int i = 0; i < n; i++){
              distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
            }
            return distance;
}')


dt1 <- data.table(id = rep(1, 6),
                  seq1 = 1:6,
                  x = c(1:6),
                  y = c(1:6))

dt2 <- data.table(id = rep(1, 6),
                  seq2 = 7:12,
                  x = c(6:1),
                  y = c(6:1))

When doing a join first, then calculating the distance, both functions produce the same result

dt_cpp <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_cpp[, dist := cppEucDistance(x, y, i.x, i.y)]

dt_r <- dt1[ dt2, on = "id", allow.cartesian = T]
dt_r[, dist := rEucDist(x, y, i.x, i.y)]

all.equal(dt_cpp$dist, dt_r$dist)
# [1] TRUE

However, if I do the calculation within a join the results differ; the cpp version is incorrect.

dt_cppJoin <- dt1[
    dt2,
    { (cppEucDistance(x, y, i.x, i.y)) },
    on = "id",
    by = .EACHI
    ]

dt_rJoin <- dt1[
    dt2,
    { (rEucDist(x, y, i.x, i.y)) },
    on = "id",
    by = .EACHI
    ]

all.equal(dt_cppJoin$V1, dt_rJoin$V1)
# "Mean relative difference: 0.6173913"

## note that the R version of the join is correct
all.equal(dt_r$dist, dt_rJoin$V1)
# [1] TRUE

What is it about the Rcpp implementation that causes the join version to give a different result?

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • This might be relevant: `dt1[ dt2, list(len1 = length(x), len2 = length(i.x)), on = "id", by=.EACHI ]` -- the allow.cartesian probably matters (though I haven't looked closely at your functions). – Frank Jun 09 '17 at 00:29
  • @Frank - so it seems : `dt1[ dt2, list(len1 = length(x), len2 = length(i.x)), on = "id", allow.cartesian = T ]` - The trouble is, I'm trying to avoid the cartesian join as I don't have enough RAM. – SymbolixAU Jun 09 '17 at 00:30
  • 1
    I'm not sure how/where the symbols inside those function calls are evaluated, but have you tried printing from your R and Rcpp function their inputs? E.g. printing "x2" from both functions shows different vectors having been passed (the lengths seem to differ). – alexis_laz Jun 09 '17 at 10:51
  • @alexis_laz - good observation - I'll explore this line of enquiry further – SymbolixAU Jun 09 '17 at 21:23
  • yes; it's certainly to do with the different vector sizes and the loop. The `EACHI` is reducing the `dt2` to 1-length vectors, so my loop in the `Rcpp` code is incorrect. – SymbolixAU Jun 09 '17 at 22:08

1 Answers1

3

tl;dr

  1. Ultimately, this comes down to my lack of understanding of how EACHI works.

  2. My C++ loop as it stands is relying on x1 and x2 being the same length. This is not the case when EACHI is used, so my vector-subsetting in the C++ function won't be correct. Therefore I'll need a more sophisticated C++ function.


I believe the 'issue' I'm seeing comes down to what EACHI is doing. After re-reading Arun's answer again and again I think I understand it.

In particular, EACHI "evaluates j-expression on the matching rows for each row in Y". So it's taking the Y table (dt2 in my case) and evaluating it each row at a time. This can be seen if you use a few prints inside the Rcpp function

## I'm reducing the data sets so the 'prints' are readible
dt1 <- data.table(id = rep(1, 2),
                  seq1 = 1:2,
                  x = c(1:2),
                  y = c(1:2))

dt2 <- data.table(id = rep(1, 2),
                  seq2 = 7:8,
                  x = c(6:5),
                  y = c(6:5))

cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1,
                                          NumericVector x2, NumericVector y2){

  int n = x1.size();
  NumericVector distance(n);

  Rcpp::Rcout << "x1 size: " << x1.size() << std::endl;
  Rcpp::Rcout << "x2 size: " << x2.size() << std::endl;
  Rcpp::Rcout << "n size: " << n << std::endl;

  for(int i = 0; i < n; i++){
    distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2));
  }
  return distance;
}')

dt_cppJoin <- dt1[
  dt2,
  {print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
  on = "id",
  by = .EACHI
  ]
# [1] 6
# x1 size: 2
# x2 size: 1
# n size: 2
# [1] 5
# x1 size: 2
# x2 size: 1
# n size: 2

Notice in the output of the print statement that x2 is only length 1 each iteration. Whereas n is 2. So my x2[i] will obviously 'fail' when i gets to the second element (remembering that C++ arrays are 0-indexed)

Whereas taking out the EACHI has the same effect as the full-join approach

dt_cppJoin <- dt1[
  dt2,
  {print(i.x); (cppEucDistance(x, y, i.x, i.y)) },
  on = "id"
  ]
# [1] 6 6 5 5
# x1 size: 4
# x2 size: 4
# n size: 4

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139