I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:
library(dplyr)
library(rlang)
library(sparklyr)
sc1 <- spark_connect(master = "local")
mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)
mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10
# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
# set equation and print - use sym() convert a string
mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
# mutate formula - LHS resolves to a string, RHS a quosure
mtcars_sp = mtcars_sp %>%
mutate(!!mtcars_cols[i] := !!mtcars_eq )
}
dbplyr::sql_render(mtcars_sp)
mtcars_sp
This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?
BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!