Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

Question

A reproducible example (adapted from @forestfanjoe's answer):

library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")

df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))

df <- copy_to(sc, df, "payment")

> head(df)
# Source: spark<?> [?? x 2]
     id PaymentHistory
* <int>          <dbl>
1     1         -0.138
2     2         -0.249
3     3         -0.805
4     4          1.30 
5     5          1.54 
6     6          0.936

fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}

df %>% fix_PaymentHistory

The error is:

 Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory >  : 
 object 'PaymentHistory' not found

I'm using the scope operator because I'm afraid that the name in dplyr will clash with some of the user-defined code. Note that PaymentHistory is a column variable in df.

The same error is not present when running the following code:

fix_PaymentHistory <- function(df){
    df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
      id PaymentHistory
 * <int>          <dbl>
 1     1         0     
 2     2         0     
 3     3         0     
 4     4         1     
 5     5         1     
 6     6         0.936 
 7     7         0     
 8     8         0.716 
 9     9         0     
10    10         0.0831
# ... with more rows

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 24 '18 at 15:17
seems you were not using dplyr's `if_else` in the first place, are you currently using another package with an `if_else` function included? or did you define your own? what does F1 on if_else show you? — gaut, Oct 24 '18 at 15:18
The only package I'm using is dplyr and sparklyr. I'm certain that if_else is from dplyr. — Catiger3331, Oct 24 '18 at 17:22
So this just to confirm this only happens if you use a data source from `sparklyr`? If I use the same function with just a standard data.frame, I don't get an error (I don't have `library(sparklyr)` installed so I can't test with that. — MrFlick, Oct 24 '18 at 19:07
@MrFlick Yes, when I ran the same code with native data.frame, everything is fine. — Catiger3331, Oct 24 '18 at 19:40
@MrFlick MCVE without `sparklyr`: `library(magrittr); db <- dplyr::src_sqlite(":memory:", TRUE); dplyr::copy_to(db, mtcars); db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))` — zero323, Oct 24 '18 at 22:20

zero323 · Accepted Answer · 2018-10-24T22:18:31.117

TL;DR Because your code doesn't use dplyr::if_else at all.

sparklyr, when used as in the example, treats Spark as yet another database and issues queries using dbplyr SQL translation layer.

In this context if_else is no treated as a function, but an identifier which is converted to SQL primitives:

dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END

However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope.

I'm afraid that the name in dplyr will clash with some of the user-defined code.

As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions).

Of course this mechanism is not specific to sparklyr and you're likely to see the same behavior using other table backed by a database:

library(magrittr)

db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)

db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))

Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found

Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

1 Answers1

Linked