Replace pyspark column based on other columns

Question

In my "data" dataframe, I have 2 columns, 'time_stamp' and 'hour'. I want to insert 'hour' column values where 'time_stamp' values is missing. I do not want to create a new column, instead fill missing values in 'time_stamp'

What I'm trying to do is replace this pandas code to pyspark code:

data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)

Possible duplicate of [Spark Equivalent of IF Then ELSE](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) — pault, Mar 21 '19 at 15:11

score 1 · Accepted Answer · edited Mar 26 '19 at 09:32

1

Something like this should work

from pyspark.sql import functions as f

df = (df.withColumn('time_stamp',
 f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped

Alternatively, if you don't like sql:

df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

edited Mar 26 '19 at 09:32

Prathik Kini

1,067
11
25

answered Mar 21 '19 at 14:37

ags29

2,621
1
8
14

score 1 · Answer 2 · answered Jun 21 '23 at 14:52

1

You can also use the function "coalesce" which replaces missing values in a given order defined by the index of columns given as function input. In your case the timestamp column would be filled by hour where it is missing.

import pyspark.sql.functions as F
data = data.withColumn('time_stamp', F.coalesce('time_stamp', 'hour')

Description to the function: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.coalesce.html

answered Jun 21 '23 at 14:52

simons____

101
1
2

while the other answer is a more direct translation, this answer is a better practice for this specific use-case. – Nolan Barth Jun 21 '23 at 19:15

Replace pyspark column based on other columns

2 Answers2