I have a dataframe like that:
df1=spark.createDataFrame([(7573 , ['Apple', 'zucchini', 'eggplant', 'beer', 'spaghetti']),(2331 , ['beer', 'eggplant', 'Strawberry', 'cherryberry', 'Chicken','Fish','Meat']),(5541 , ['Cola', 'Pizza', 'Strawberry', 'cheese', 'Chicken','Yogurt','Meat'])],
('ID' , 'Items'))
df1.show(truncate=False)
+----+--------------------------------------------------------------+
|ID |Items |
+----+--------------------------------------------------------------+
|7573|[Apple, zucchini, eggplant, beer, spaghetti] |
|2331|[beer, eggplant, Strawberry, cherryberry, Chicken, Fish, Meat]|
|5541|[Cola, Pizza, Strawberry, cheese, Chicken, Yogurt, Meat] |
+----+--------------------------------------------------------------+
I want to create a new data that contains all the products in columns and checks whether the customer purchased them. If yes, we will mark 1 and if not, then we will mark 0.
It should look like that:
+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+
| ID|Apple|Chicken|Cola|Fish|Meat|Pizza|Strawberry|Yogurt|beer|cheese|cherryberry|eggplant|spaghetti|zucchini|
+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+
|2331| 0| 1| 0| 1| 1| 0| 1| 0| 1| 0| 1| 1| 0| 0|
|5541| 0| 1| 1| 0| 1| 1| 1| 1| 0| 1| 0| 0| 0| 0|
|7573| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 1| 1| 1|
+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+
I was able to do this by the following code:
df1 = df1.withColumn('exploded', F.explode('Items')).groupBy("ID").pivot("exploded").agg(F.lit(1)).na.fill(0).show()
But my problem is when I use it for a larger database, I get an error The pivot column exploded has more than 10000 distinct values.
How can I solve that?