1

I tried to use the pipeline ideology in building an ML routine in sparklyr.

Apparently the ft_dplyr_transformer does not support table pivoting, since this part:

%>% sdf_pivot(formula = Who + time_period ~ What_Action, fun.aggregate = "count") %>%
              na.replace(0)

crushes the overall result. If I omit it, the rest works just fine.

Is this really the case, or I miss some of the basics of the pipelining?

spark_ml_pipeline <- 
     ml_pipeline(r_spark_connection) %>%
     ft_dplyr_transformer(
          preprocessed_spark_df %>%
          mutate(
               time_period = if(Date < '2017-12-01') {
                    'train_period'
               } else {
                    'test_period'
               }
          ) %>%
          mutate(
               What_Action = translate(What_Action, ' ', '_')
          ) %>%
          filter(
               !ObjectType %in% c('logon')
          ) %>%
          sdf_pivot(formula = Who + time_period ~ What_Action, fun.aggregate = "count") %>%
          na.replace(0)
     )

A help for the function reads:

Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Alexey Burnakov
  • 259
  • 2
  • 14
  • 1
    This sounds about right. There is no pivoting operator in SQL. It belongs to `DataFrame` API. You can do it by hand (like [this](https://stackoverflow.com/a/30397833/6910411)). – zero323 Feb 15 '18 at 00:42
  • "There is no pivoting operator in SQL" are you referring to the spark.sql library https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$? In general, SQl does have pivot/unpivot functions. OK, thank you for the manual method. I think sometimes it will be helpful. – Alexey Burnakov Feb 15 '18 at 09:48
  • 1
    `ft_dplyr_transformer()` only supports stuff that can get translated to SQL, and as @user6910411 mentioned, `sdf_pivot()` calls the `pivot` method of `DataFrame` and hence can't be part of a pipeline. – kevinykuo Feb 18 '18 at 03:44

0 Answers0