Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

Question

I have the same problem as asked here but I need a solution in pyspark and without breeze.

For example if my pyspark dataframe look like this:

user    |  weight  |  vec
"u1"    | 0.1      | [2, 4, 6]
"u1"    | 0.5      | [4, 8, 12]
"u2"    | 0.5      | [20, 40, 60]

where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this:

user    |  wsum
"u1"    | [2.2, 4.4, 6.6]
"u2"    | [10, 20, 30]

To do this I have tried the following:

df.groupBy('user').agg((F.sum(df.vec* df.weight)).alias("wsum"))

But it failed as the vec column and weight columns have different types.

How can I solve this error without breeze?

score 1 · Answer 1 · answered Jan 08 '20 at 12:18

On way using higher-order function transform availiable from Spark 2.4:

# get size of vec array
n = df.select(size("vec")).first()[0]

# transform each element of the vec array
transform_expr = "transform(vec, x -> x * weight)"

df.withColumn("weighted_vec", expr(transform_expr)) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum"))\
  .show()

Gives:

+----+------------------+
|user|              wsum|
+----+------------------+
|  u1|   [2.2, 4.4, 6.6]|
|  u2|[10.0, 20.0, 30.0]|
+----+------------------+

For Spark < 2.4, using a for comprehension to multiply each element by the weight column like this:

df.withColumn("weighted_vec", array(*[col("vec")[i] * col("weight") for i in range(n)])) \
  .groupBy("user").agg(array(*[sum(col("weighted_vec")[i]) for i in range(n)]).alias("wsum")) \
  .show()

Thanks for the prompt answer. It seems it does something, but I am currently running out of memory with a java.lang.OutOfMemoryError. Not sure but might be related to serializing issue as discussed in first answer [here](https://stackoverflow.com/questions/36140493/java-lang-outofmemoryerror-in-pyspark) — Urian, Jan 08 '20 at 16:17

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

1 Answers1

Linked