0

I have a pyspark dataframe event1. It has many columns and one of them is eventAction having categorical values like 'conversion', 'check-out', etc.

I wanted to convert this column in a way that 'conversion' becomes 1 and other categories become 0 in eventAction column.

This is what I tried:

event1.eventAction = event1.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0))
event1.show()

But I don't see any change in eventAction column when .show() is executed.

mck
  • 40,932
  • 13
  • 35
  • 50
  • Does this answer your question? [How do I add a new column to a Spark DataFrame (using PySpark)?](https://stackoverflow.com/questions/33681487/how-do-i-add-a-new-column-to-a-spark-dataframe-using-pyspark) – blackbishop Mar 23 '21 at 08:46
  • And this one: [Updating a dataframe column in spark](https://stackoverflow.com/questions/29109916/updating-a-dataframe-column-in-spark) – blackbishop Mar 23 '21 at 08:57
  • @blackbishop those questions may be related, but frankly, the answers are all over-complicated / even irrelevant to this relatively simple question. – mck Mar 23 '21 at 09:00
  • @blackbishop This question is different in that it used a somewhat intuitive approach to update columns through the column attribute of the dataframe, but somehow failed and thus confused the OP. This usage, though incorrect, has not been asked before, so it's worth being left open. – mck Mar 23 '21 at 09:19

1 Answers1

1

Spark dataframes are immutable, so you cannot change the column directly using the . notation. You need to create a new dataframe that replaces the existing column using withColumn.

import pyspark.sql.functions as F

event1 = event1.withColumn(
    'eventAction', 
    F.when(F.col('eventAction') == 'conversion', 1).otherwise(0)
)
mck
  • 40,932
  • 13
  • 35
  • 50