Normalize column with Spark

Question

I have a data file with three columns, and I want to normalize the last column to apply ALS with ML (Spark and Scala), how can I do it?

Here is an excerpt from my Dataframe:

val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
                    .withColumnRenamed("userIdTemp", "userId")
                    .withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
                    .withColumnRenamed("productIdTemp", "productId")
                    .withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
                    .withColumnRenamed("viewTemp", "view")`

[Feature normalization algorithm in Spark](https://stackoverflow.com/q/34234817/6910411) — zero323, May 03 '18 at 15:54

score 9 · Accepted Answer · answered May 04 '18 at 02:31

Using the StandardScaler is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector type (but Float). Since the StandardScaler only works on Vectors, a VectorAssembler can be applied first, but then the Vector needs to be reconverted into a Float after the scaling.

The simpler way in this case would be to do it yourself. First get the mean and standard deviation of the column and then perform the scaling. It can be done on the view column as follows:

val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
  .as[(Double, Double)]
  .first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)

How can I do the same in Python ? (specifically the $"view" part) — Simon30, Jul 04 '19 at 14:20
@Simon30: You can take a look at: https://stackoverflow.com/questions/47624129/how-to-standardize-one-column-in-spark-using-standardscaler — Shaido, Jul 05 '19 at 01:13

Normalize column with Spark

1 Answers1