When trying to scale a column/feature in a Spark Data Frame I need to first assemble the feature into an list/array. I'm using the R package sparklyr
but this should be the same in Scala or Python.
If I try without assembling the feature I'm trying to scale I get:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
copy_to(sc, mtcars, "mtcars")
tbl(sc, "mtcars") %>%
ft_standard_scaler(input_col = "wt", output_col = "wt_scaled")
Error: java.lang.IllegalArgumentException: requirement failed: Column wt must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.
at scala.Predef$.require(Predef.scala:224)
But if I use ft_vector_assemble()
it does the job.
tbl(sc, "mtcars") %>%
ft_vector_assembler(input_col = "wt", output_col = "wt_temp") %>%
ft_standard_scaler(input_col = "wt_temp", output_col = "wt_scaled") %>%
select(wt, wt_scaled)
#> # Source: spark<?> [?? x 2]
#> wt wt_scaled
#> <dbl> <list>
#> 1 2.62 <dbl [1]>
#> 2 2.88 <dbl [1]>
#> 3 2.32 <dbl [1]>
#> 4 3.22 <dbl [1]>
#> 5 3.44 <dbl [1]>
#> 6 3.46 <dbl [1]>
#> 7 3.57 <dbl [1]>
#> 8 3.19 <dbl [1]>
#> 9 3.15 <dbl [1]>
#> 10 3.44 <dbl [1]>
#> # … with more rows
Created on 2019-08-16 by the reprex package (v0.3.0)
First of all, is there a reason why I have to assemble the feature? I realize that it's needed when you have multiple features, but why do you have to do it if you only have one?
Second, if I want to inspect or plot the values of the scaled column, is there a way to unlist the new column in Spark?