2

I am working within PySpark, and have a transaction table imported as a Spark DataFrame as follows:

User_ID     Date     Product_Name
--------    ------      -------------
A           2019-11-30.     Product 1
B           2019-10-20      Product 2
C           2019-10-01      Product 1
A           2019-12-01      Product 1

What I am trying to do is create a resulting table that for each unique User_ID, counts whether or not that user has bought more of product 1 than product 2, and then will return the string, "Product 1", or "Product 2" in the other case in the second column of this new table.

I am finding it difficult to in PySpark.

Thomas Moore
  • 941
  • 2
  • 11
  • 17

1 Answers1

1

Create a pivot table with columns for the counts of Product 1 and Product 2. Then the two columns can be compared.

df.groupBy("User_ID").pivot("Product_Name").agg(F.count("Product_Name")) \
    .fillna(0) \
    .withColumn("result", F.when(F.col("Product 1") > F.col("Product 2"), "Product 1").otherwise("Product 2")) \
    .show()

Output:

+-------+---------+---------+---------+
|User_ID|Product 1|Product 2|   result|
+-------+---------+---------+---------+
|      B|        0|        1|Product 2|
|      C|        1|        0|Product 1|
|      A|        2|        0|Product 1|
+-------+---------+---------+---------+

For a more generic approach how to calculate the result column (especially if there are more than two distinct products) please check this answer.

werner
  • 13,518
  • 6
  • 30
  • 45