PySpark Applying Function to Unique Elements of a Row

Question

I am working within PySpark, and have a transaction table imported as a Spark DataFrame as follows:

User_ID     Date     Product_Name
--------    ------      -------------
A           2019-11-30.     Product 1
B           2019-10-20      Product 2
C           2019-10-01      Product 1
A           2019-12-01      Product 1

What I am trying to do is create a resulting table that for each unique User_ID, counts whether or not that user has bought more of product 1 than product 2, and then will return the string, "Product 1", or "Product 2" in the other case in the second column of this new table.

I am finding it difficult to in PySpark.

what is your expected table or dataframe? I cannot imagine. – Lamanus Oct 03 '20 at 02:40 — Lamanus, Oct 03 '20 at 02:40

score 1 · Accepted Answer · answered Oct 04 '20 at 12:26

Create a pivot table with columns for the counts of Product 1 and Product 2. Then the two columns can be compared.

df.groupBy("User_ID").pivot("Product_Name").agg(F.count("Product_Name")) \
    .fillna(0) \
    .withColumn("result", F.when(F.col("Product 1") > F.col("Product 2"), "Product 1").otherwise("Product 2")) \
    .show()

Output:

+-------+---------+---------+---------+
|User_ID|Product 1|Product 2|   result|
+-------+---------+---------+---------+
|      B|        0|        1|Product 2|
|      C|        1|        0|Product 1|
|      A|        2|        0|Product 1|
+-------+---------+---------+---------+

For a more generic approach how to calculate the result column (especially if there are more than two distinct products) please check this answer.

PySpark Applying Function to Unique Elements of a Row

1 Answers1