I've got a big dataset I'm trying to process using dyplyr on a distributed sparklyr tbl. I've been able to use other functions in mutate that I have attempted so far but base::grepl is returning an error. The process single-thread process that I want to replicate using spark is:
df.dummy <- data.frame(name = c('100', '101', 'c102', '103', 'c104'), value = seq(1,5))
df.dummy %>%
mutate(cat = grepl('c', name))
name value cat
1 100 1 FALSE
2 101 2 FALSE
3 c102 3 TRUE
4 103 4 FALSE
5 c104 5 TRUE
And the code I'm trying to run to get it to work in distributed processing:
sdf.dummy <- copy_to(sc, df.dummy)
sdf.dummy %>%
mutate(cat = grepl('c', name))
Which yields the following error message:
Error : org.apache.spark.sql.AnalysisException: Undefined function: 'GREPL'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24
As grep is a base function I can't imagine that this is problem with it not being loaded on the worker nodes. Fairly new to spark/sparklyr/dplyr so please correct me if I've misunderstood any of the fundamentals of the process.