Looking to substract every value in a row based on the value of a separate DF

Question

As the title states, I would like to subtract each value of a specific column by the mean of that column.

Here is my code attempt:

val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))


val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
  .withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()

score 2 · Accepted Answer · answered Apr 13 '21 at 01:03

You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).

val moviePairs = spark.createDataFrame(
  Seq(
        ("Moonlight", 7, 8),
        ("Lord Of The Drinks", 10, 1),
        ("The Disaster Artist", 3, 5),
        ("Airplane!", 7, 9),
        ("2001", 5, 1),
    )
).toDF("movie", "rating1", "rating2")

// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()

// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()

val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))

subMean.show()

Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):

+-------------------+-------+-------+-------------------+-------------------+
|              movie|rating1|rating2|         meanDeltaX|         meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
|          Moonlight|      7|      8| 0.5999999999999996|                3.2|
| Lord Of The Drinks|     10|      1| 3.5999999999999996|               -3.8|
|The Disaster Artist|      3|      5|-3.4000000000000004|0.20000000000000018|
|          Airplane!|      7|      9| 0.5999999999999996|                4.2|
|               2001|      5|      1|-1.4000000000000004|               -3.8|
+-------------------+-------+-------+-------------------+-------------------+

Looking to substract every value in a row based on the value of a separate DF

1 Answers1