1

I'm trying to write a function that performs and sdf_pivot() a creates a Spark DataFrame with column names that includes the name of the original variable or column.

set.seed(80)
df <- data.frame(id = c(1:5),
                 var1 = sample(LETTERS[1:12], 5, replace = TRUE),
                 var2 = sample(LETTERS[13:16], 5, replace = TRUE))

ref <- copy_to(sc, df, "mytbl")
glimpse(ref)

Observations: 5
Variables: 3
$ id   <int> 1, 2, 3, 4, 5
$ var1 <chr> "F", "G", "J", "A", "H"
$ var2 <chr> "M", "O", "O", "O", "O"

This the expected result for var1 without writing the function:

ref %>% 
  dplyr::select(id, var1) %>%
  dplyr::mutate(newvar1 = paste0("var1_",var1)) %>%
  sparklyr::sdf_pivot(formula = id ~ newvar1, fun.aggregate = "count") %>% 
  sparklyr::na.replace(0)

# Source:   table<sparklyr_tmp_56f96ab7d507> [?? x 6]
# Database: spark_connection
     id var1_A var1_F var1_G var1_H var1_J
  <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      0      1      0      0      0
2     3      0      0      0      0      1
3     5      0      0      0      1      0
4     4      1      0      0      0      0
5     2      0      0      1      0      0

Below one of my function versions, which of course doesn't work, I've also tried with quote and deparse, but I'm stuck with both mutate_ and sdf_pivot.

myPivotFunction <- function(sdf, varname, newvarname){

  mutate_op <- paste0(newvarname," = ", "var1_", varname)

  sdf %>% 
    dplyr::select_(.dots = list('id', varname)) %>%
    mutate_(.dots = setNames(newvarname, mutate_op)) %>%
    sparklyr::sdf_pivot(formula = id ~ newvar1, fun.aggregate = "count") %>% 
    sparklyr::na.replace(0)
}
M--
  • 25,431
  • 8
  • 61
  • 93
guzu92
  • 737
  • 1
  • 12
  • 28

1 Answers1

0

A little bit of rlang should do the trick:

library(rlang)
library(glue)

myPivotFunction <- function(sdf, varname, newvarname){
  exprs <- c("id", glue('paste0("var1_", {varname})')) %>% 
    setNames(c("id", newvarname)) %>% 
    lapply(parse_quosure)

  sdf %>%
    transmute(!!! exprs) %>%
    sdf_pivot(
      formula = as.formula(glue("id ~ {newvarname}")), 
      fun.aggregate = "count") %>%
    na.replace(0)
}
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Works great! thanks!, do you know if there is a ref document explaining SE and SNE programming with dplyr ? the select_, filter_, summarise_, etc.. methods ? – guzu92 Feb 15 '18 at 12:37
  • With regards to the hardcoded `"var1_"` inside `glue()`, I can build the desired string, say `str1`, with `varname` arg and do a `glue(str1)`; I can also get rid of `newvarname` as argument of the function, just replacing it by `"tmpvar"` inside `setNames()` and by `tmpvar` inside `as.formula(glue(...))` – guzu92 Feb 15 '18 at 17:47
  • TBH I didn't understand what is the point of `var1_` part in the first place. It looks more like a job for rename. – zero323 Feb 15 '18 at 19:00
  • Is is possible to do this, but with the fun.aggregate being a "sum"? This might be a new question... – schnee May 21 '18 at 19:51