Join two tables while multiplying columns x rows in Scala

Question

I have the following datasets:

User Table

+-------+---------+
|user_id|    value|
+-------+---------+
|  user1|[1, 2, 3]|
|  user2|[4, 5, 6]|
|  user3|[7, 8, 9]|
+-------+---------+

Items Table

+---------------+---------------+---------------+
|          item1|          item2|          item3|
+---------------+---------------+---------------+
|[0.5, 0.6, 0.7]|[0.2, 0.3, 0.4]|[0.1, 0.8, 0.9]|
+---------------+---------------+---------------+

I want to generate the following DS by multiplying UsersXItems

+-------+-----+-----+-----+
|user_id|item1|item2|item3|
+-------+-----+-----+-----+
|  user1|  3.8|    2|  4.4|
|  user2|  9.2|  4.7|  9.8|
|  user3| 14.6|  7.4| 15.2|
+-------+-----+-----+-----+

I was initially thinking on a cross join to get all fields together and then doing the multiplication row by row and column by column but that seems like the wrong approach (and a very slow and memory-intensive process).

Is there a better approach I should use?

I'm using Scala and Spark 3.1

score 3 · Answer 1 · answered Apr 27 '23 at 19:00

3

I don't know if you can avoid the outer join but for the dot calculation, you can use array_zip and aggregate.

Pyspark 3.1+

from pyspark.sql import functions as F

cols = ['item1', 'item2', 'item3']

df = (df_user.join(df_item, how='outer')
      .select('user_id', 
              *[F.arrays_zip(F.col('value'), c).alias(c) for c in cols])
      .select('user_id',
              *[F.aggregate(c, F.lit(0.0), lambda acc, x: acc + x['value'] * x[c]).alias(c) for c in cols]))

Note this can have a floating point issue (Python weird addition bug).

answered Apr 27 '23 at 19:00

Emma

8,518
1
18
35

if you are using lower version of pyspark, I can add the alternative solution. – Emma Apr 27 '23 at 19:01
sorry I missed OP is asking in Scala. – Emma Apr 27 '23 at 19:13
1

Don't worry, this is helpful. At least it seems that the cross join must happen (which makes sense) – David R Apr 27 '23 at 19:22
Nice answer..! +1 for `lambda function` :-) – notNull Apr 27 '23 at 20:38
1

@Emma Thanks again for your answer, it really helped me a lot to find a solution for this problem!!. I have answered my question below, but it would not have been possible without your guidance – David R Apr 28 '23 at 20:38

David R · Accepted Answer · 2023-04-28T20:52:29.717

I was able to solve it using Emma's example but adapting it for Scala and PySpark.

Given the inputs:

  val items = Seq(
    (Array(0.5, 0.6, 0.7), Array(0.2, 0.3, 0.4), Array(0.1, 0.8, 0.9))
  ).toDF("item1", "item2", "item3")

  val users = Seq(
    ("user1", Array(1, 2, 3)),
    ("user2", Array(4, 5, 6)),
    ("user3", Array(7, 8, 9))
  ).toDF("user_id", "value")

  val cols = Seq("item1", "item2", "item3")

I calculated the expected output as follows:

users.crossJoin(items).select(col("user_id") +: cols.map(c => {
    expr(s"aggregate(zip_with(value, `${c}`, (x, y) -> x * y), 0D, (s, x) -> s + x)").as(c)
  }): _*).show()

This outputs:

+-------+-----+-----+-----+
|user_id|item1|item2|item3|
+-------+-----+-----+-----+
|user1  |3.8  |2.0  |4.4  |
|user2  |9.2  |4.7  |9.8  |
|user3  |14.6 |7.4  |15.2 |
+-------+-----+-----+-----+

I have yet to determine the performance of this solution but at least 'it works' for now.

Join two tables while multiplying columns x rows in Scala

2 Answers2