I tried to use the pipeline ideology in building an ML routine in sparklyr
.
Apparently the ft_dplyr_transformer
does not support table pivoting, since this part:
%>% sdf_pivot(formula = Who + time_period ~ What_Action, fun.aggregate = "count") %>%
na.replace(0)
crushes the overall result. If I omit it, the rest works just fine.
Is this really the case, or I miss some of the basics of the pipelining?
spark_ml_pipeline <-
ml_pipeline(r_spark_connection) %>%
ft_dplyr_transformer(
preprocessed_spark_df %>%
mutate(
time_period = if(Date < '2017-12-01') {
'train_period'
} else {
'test_period'
}
) %>%
mutate(
What_Action = translate(What_Action, ' ', '_')
) %>%
filter(
!ObjectType %in% c('logon')
) %>%
sdf_pivot(formula = Who + time_period ~ What_Action, fun.aggregate = "count") %>%
na.replace(0)
)
A help for the function reads:
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset.