Replacement for Spark's CASE WHEN THEN

Question

I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:

There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:

values = [{'target': 'Unknown', 'source': '', 'column': 'gender'}, 
{'target': 'F', 'source': '0', 'column': 'gender'}, 
{'target': 'M', 'source': '1', 'column': 'gender'}, 
{'target': 'F', 'source': 'F', 'column': 'gender'}, 
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]

The following code generates the CASE WHEN THEN statement that follows.

for value in values:
    source_value = value.get("source")
    op = op.when(df[column] == source, value.get("target"))

Column<'CASE WHEN (gender = ) THEN Unknown 
WHEN (gender = 0) THEN F 
WHEN (gender = 1) THEN M 
WHEN (gender = F) THEN F 
WHEN (gender = Fe) THEN F END'>

This CASE WHEN THEN is then used to select data from a dataframe.

Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?

I looked at the below questions, but were not relevant for my case. CASE WHEN ... THEN SPARK SQL - case when then

Thanks.

I feel like this is better as a UDF with basic if-elif statements — OneCricketeer, Sep 07 '21 at 11:49
@OneCricketeer, thanks. Is there a way to compare dictionaries instead of using CASE WHEN? Like, use 2 dictionaries, one with source and other with target and get a third one with the replaced values? — Sarvavyapi, Sep 07 '21 at 11:55

score 1 · Answer 1 · answered Sep 07 '21 at 14:41

1

Two alternatives:

Use UDF, in which you can access a dictionary of values
Build a table, and perform broadcast join

The way to know which is better is by examining the execution plan, job duration and total shuffle.

answered Sep 07 '21 at 14:41

Yosi Dahari

6,794
5
24
44

Replacement for Spark's CASE WHEN THEN

1 Answers1