Why do I need to assemble vector before scaling in Spark?

Question

When trying to scale a column/feature in a Spark Data Frame I need to first assemble the feature into an list/array. I'm using the R package sparklyr but this should be the same in Scala or Python.

If I try without assembling the feature I'm trying to scale I get:

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")

copy_to(sc, mtcars, "mtcars")

tbl(sc, "mtcars") %>% 
   ft_standard_scaler(input_col = "wt", output_col = "wt_scaled")

Error: java.lang.IllegalArgumentException: requirement failed: Column wt must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.
        at scala.Predef$.require(Predef.scala:224)

But if I use ft_vector_assemble()it does the job.

tbl(sc, "mtcars") %>% 
    ft_vector_assembler(input_col = "wt", output_col = "wt_temp") %>% 
    ft_standard_scaler(input_col = "wt_temp", output_col = "wt_scaled") %>% 
    select(wt, wt_scaled)
#> # Source: spark<?> [?? x 2]
#>       wt wt_scaled
#>    <dbl> <list>   
#>  1  2.62 <dbl [1]>
#>  2  2.88 <dbl [1]>
#>  3  2.32 <dbl [1]>
#>  4  3.22 <dbl [1]>
#>  5  3.44 <dbl [1]>
#>  6  3.46 <dbl [1]>
#>  7  3.57 <dbl [1]>
#>  8  3.19 <dbl [1]>
#>  9  3.15 <dbl [1]>
#> 10  3.44 <dbl [1]>
#> # … with more rows

^{Created on 2019-08-16 by the reprex package (v0.3.0)}

First of all, is there a reason why I have to assemble the feature? I realize that it's needed when you have multiple features, but why do you have to do it if you only have one?

Second, if I want to inspect or plot the values of the scaled column, is there a way to unlist the new column in Spark?

Spark gives us a number of options to do transformations to standardize/normalize the data. I’ll go through the StandardScaler, Normalizer and the MinMaxScaler here. To implement any of the aforementioned transformations, we need to assemble the features into a feature vector. Any guesses on what we can use here? Correct, Vector Assemblers! — thebluephantom, Aug 16 '19 at 13:43
But does that make sense if you only want to scale one feature? — FilipW, Aug 16 '19 at 14:27
See https://stackoverflow.com/questions/57522124/why-do-i-need-to-assemble-vector-before-scaling-in-spark?noredirect=1#comment101519224_57522124 — thebluephantom, Aug 16 '19 at 14:28

score 1 · Accepted Answer · answered Aug 16 '19 at 16:21

You should look at it from an engineering perspective. When you accept other types as vectors, you have to write some code to handle that types and cast it in certain scenarios. Especially the performance optimizations parts of spark have to cover such scenarios (check this answer why vectors are beneficial in general).

That would force every developer of a machine learning algorithm for spark to implement a lot of code to cover plenty of different scenarios. When you combine all that code (and keep it out of the machine learning algorithms like standard scaler), you get something like the current vectorassembler. This keeps the code of standard scaler and other algorithms cleaner, as he only has to handle vectors.

Of course this requires you to call the vector assembler even if you just have one feature column, but it keeps the code of spark itself much cleaner.

Regarding your other question: You can disassemble vectors with an udf in pyspark (check this answer for a pyspark example), but I don't know how to do it in R.

Why do I need to assemble vector before scaling in Spark?

1 Answers1