1

I've got a big dataset I'm trying to process using dyplyr on a distributed sparklyr tbl. I've been able to use other functions in mutate that I have attempted so far but base::grepl is returning an error. The process single-thread process that I want to replicate using spark is:

df.dummy <- data.frame(name = c('100', '101', 'c102', '103', 'c104'), value = seq(1,5))

df.dummy %>% 
   mutate(cat = grepl('c', name))

  name value   cat
1  100     1 FALSE
2  101     2 FALSE
3 c102     3  TRUE
4  103     4 FALSE
5 c104     5  TRUE

And the code I'm trying to run to get it to work in distributed processing:

sdf.dummy <- copy_to(sc, df.dummy)

sdf.dummy %>% 
   mutate(cat = grepl('c', name))

Which yields the following error message:

Error : org.apache.spark.sql.AnalysisException: Undefined function: 'GREPL'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24

As grep is a base function I can't imagine that this is problem with it not being loaded on the worker nodes. Fairly new to spark/sparklyr/dplyr so please correct me if I've misunderstood any of the fundamentals of the process.

Tom
  • 126
  • 1
  • 6

0 Answers0