How to make a new column per group to indicate the number of items each user has?

Question

Assume I have a Pyspark dataframe as shown below. Each user bought one item on some specific date.

+--+-------------+-----------+
|ID|  Item Bought| Date      |
+--+-------------+-----------+
|1 |  Laptop     | 01/01/2018|  
|1 |  Laptop     | 12/01/2017|  
|1 |  Car        | 01/12/2018|  
|2 |  Cake       | 02/01/2018|  
|3 |  TV         | 11/02/2017| 
+--+-------------+-----------+

Now I would like to create a new data frame as shown below.

+---+--------+-----+------+----+
|ID | Laptop | Car | Cake | TV |
+---+--------+-----+------+----+
|1  | 2      | 1   | 0    | 0  | 
|2  | 0      | 0   | 1    | 0  |
|3  | 0      | 0   | 0    | 1  |
+---+--------+-----+------+----+

There are item columns, each column for one item. For each user, the number on each column is the number of that items user bought.

Can you please be more clear? What are the initial column names. — mayank agrawal, Mar 08 '18 at 14:20
Possible duplicate of [How to pivot Spark DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-spark-dataframe) — pault, Mar 08 '18 at 15:06

score 2 · Answer 1 · edited Mar 08 '18 at 15:03

If you have data in pyspark as a dataframe like this

df = sc.parallelize(([(1, 'laptop', '01/01/2018'),
                    (1, 'laptop', '12/01/2017'),
                    (1, 'car', '01/12/2018'),
                    (2, 'cake', '02/01/2018'),
                    (3, 'tv', '11/02/2017')])).toDF(['id', 'item bought', 'date'])

Now, you can use groupby and pivot operations to get the result.

df2 = (df.groupby(['id']).pivot('item bought', ['tv','cake', 'laptop',"car"]).
                count().fillna(0).show())
df2.show()

result

+---+---+----+------+---+
| id| tv|cake|laptop|car|
+---+---+----+------+---+
|  1|  0|   0|     2|  1|
|  3|  1|   0|     0|  0|
|  2|  0|   1|     0|  0|
+---+---+----+------+---+

Remember in pivot operation it is not necessary to supply the distinct values but supplying those values will speed up the process.

@Tran Dinh Cuong Did this solve your problem? If yes, please accept the answer. — pauli, Mar 10 '18 at 07:26

score 0 · Answer 2 · edited Mar 08 '18 at 19:07

Another solution,

import pyspark.sql.functions as F
df = sc.parallelize([
(1,'Laptop','01/01/2018'), (1,  'Laptop','12/01/2017'),(1,'Car','01/12/2018'),
(2 ,'Cake', '02/01/2018'),(3,'TV','11/02/2017')]).toDF(['ID','Item','Date'])


items = sorted(df.select("Item").distinct().rdd\
           .map(lambda row: row[0])\
           .collect())

cols = [F.when(F.col("Item") == m, F.col("Item")).otherwise(None).alias(m) for m in items]
counts = [F.count(F.col(m)).alias(m) for m in items]

df_reshaped = df.select(F.col("ID"), *cols)\
                .groupBy("ID")\
                .agg(*counts)
df_reshaped.show()

How to make a new column per group to indicate the number of items each user has?

2 Answers2