PySpark: Create a column for each item in the list and check if the item exists for each ID

Question

I have a dataframe like that:

df1=spark.createDataFrame([(7573  , ['Apple', 'zucchini', 'eggplant', 'beer', 'spaghetti']),(2331  , ['beer', 'eggplant', 'Strawberry', 'cherryberry', 'Chicken','Fish','Meat']),(5541  , ['Cola', 'Pizza', 'Strawberry', 'cheese', 'Chicken','Yogurt','Meat'])],
('ID'  , 'Items'))
df1.show(truncate=False)

+----+--------------------------------------------------------------+
|ID  |Items                                                         |
+----+--------------------------------------------------------------+
|7573|[Apple, zucchini, eggplant, beer, spaghetti]                  |
|2331|[beer, eggplant, Strawberry, cherryberry, Chicken, Fish, Meat]|
|5541|[Cola, Pizza, Strawberry, cheese, Chicken, Yogurt, Meat]      |
+----+--------------------------------------------------------------+

I want to create a new data that contains all the products in columns and checks whether the customer purchased them. If yes, we will mark 1 and if not, then we will mark 0.

It should look like that:

+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+
|  ID|Apple|Chicken|Cola|Fish|Meat|Pizza|Strawberry|Yogurt|beer|cheese|cherryberry|eggplant|spaghetti|zucchini|
+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+
|2331|    0|      1|   0|   1|   1|    0|         1|     0|   1|     0|          1|       1|        0|       0|
|5541|    0|      1|   1|   0|   1|    1|         1|     1|   0|     1|          0|       0|        0|       0|
|7573|    1|      0|   0|   0|   0|    0|         0|     0|   1|     0|          0|       1|        1|       1|
+----+-----+-------+----+----+----+-----+----------+------+----+------+-----------+--------+---------+--------+

I was able to do this by the following code:

df1 = df1.withColumn('exploded', F.explode('Items')).groupBy("ID").pivot("exploded").agg(F.lit(1)).na.fill(0).show()

But my problem is when I use it for a larger database, I get an error The pivot column exploded has more than 10000 distinct values.

How can I solve that?

Does this answer your question? [how to set spark.sql.pivotMaxValues in scala?](https://stackoverflow.com/questions/57260769/how-to-set-spark-sql-pivotmaxvalues-in-scala) — Saaru Lindestøkke, Feb 06 '23 at 23:34
@SaaruLindestøkke I have tried this and got another error: "Py4JJavaError: An error occurred while calling o810.agg." — Amit BenDavid, Feb 06 '23 at 23:46
Well, what happens when you search for that error on the web? Any pointers to what the issue might be? — Saaru Lindestøkke, Feb 06 '23 at 23:49
I would recommend rethinking your solution. If there are more than 10,000 distinct values , even if you are able to create a dataframe like this, I am not sure it will be performant or even usable because of the performance decrease. I would look into some of [the answers](https://stackoverflow.com/questions/48880934/performance-decrease-for-huge-amount-of-columns-pyspark) to a similar question — Derek O, Feb 08 '23 at 01:10

PySpark: Create a column for each item in the list and check if the item exists for each ID

0 Answers0