5

I have a data file with three columns, and I want to normalize the last column to apply ALS with ML (Spark and Scala), how can I do it?

Here is an excerpt from my Dataframe:

val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
                    .withColumnRenamed("userIdTemp", "userId")
                    .withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
                    .withColumnRenamed("productIdTemp", "productId")
                    .withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
                    .withColumnRenamed("viewTemp", "view")`
Shaido
  • 27,497
  • 23
  • 70
  • 73

1 Answers1

9

Using the StandardScaler is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector type (but Float). Since the StandardScaler only works on Vectors, a VectorAssembler can be applied first, but then the Vector needs to be reconverted into a Float after the scaling.

The simpler way in this case would be to do it yourself. First get the mean and standard deviation of the column and then perform the scaling. It can be done on the view column as follows:

val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
  .as[(Double, Double)]
  .first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • How can I do the same in Python ? (specifically the $"view" part) – Simon30 Jul 04 '19 at 14:20
  • @Simon30: You can take a look at: https://stackoverflow.com/questions/47624129/how-to-standardize-one-column-in-spark-using-standardscaler – Shaido Jul 05 '19 at 01:13