Can dplyr modify multiple columns of spark DF using a vector?

Question

I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:

library(dplyr)
library(rlang)
library(sparklyr)

sc1 <- spark_connect(master = "local")

mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)

mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10

# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
    # set equation and print - use sym() convert a string
    mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
    # mutate formula - LHS resolves to a string, RHS a quosure
    mtcars_sp = mtcars_sp %>% 
        mutate(!!mtcars_cols[i] := !!mtcars_eq )
}

dbplyr::sql_render(mtcars_sp)
mtcars_sp

This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?

BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!

zero323 · Accepted Answer · 2017-12-11T01:23:48.007

In general you can use great answer by Artem Sokolov

library(glue)

mtcars_sp %>% 
  mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>% 
    lapply(parse_quosure))

However if this is input for MLlib algorithms then ft_vector_assembler combined with ft_elementwise_product might be a better fit:

scaled <- mtcars_sp %>% 
  ft_vector_assembler(mtcars_cols, "features") %>% 
  ft_elementwise_product("features", "features_scaled", mtc_factors)

The result can be separated (I wouldn't recommend that if you're going with MLlib) into individual columns with sdf_separate_column:

scaled %>% 
  select(features_scaled) %>% 
  sdf_separate_column("features_scaled", mtcars_cols)

Thanks for that! I amn't using the MLlib algorithms for now so the first approach is perfect for me. — swany, Dec 11 '17 at 14:23

Can dplyr modify multiple columns of spark DF using a vector?

1 Answers1