1

I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:

  1. There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
  2. CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'}, 
{'target': 'F', 'source': '0', 'column': 'gender'}, 
{'target': 'M', 'source': '1', 'column': 'gender'}, 
{'target': 'F', 'source': 'F', 'column': 'gender'}, 
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
  1. The following code generates the CASE WHEN THEN statement that follows.
for value in values:
    source_value = value.get("source")
    op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown 
WHEN (gender = 0) THEN F 
WHEN (gender = 1) THEN M 
WHEN (gender = F) THEN F 
WHEN (gender = Fe) THEN F END'>
  1. This CASE WHEN THEN is then used to select data from a dataframe.

Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?

I looked at the below questions, but were not relevant for my case. CASE WHEN ... THEN SPARK SQL - case when then

Thanks.

Sarvavyapi
  • 810
  • 3
  • 23
  • 35
  • I feel like this is better as a UDF with basic if-elif statements – OneCricketeer Sep 07 '21 at 11:49
  • @OneCricketeer, thanks. Is there a way to compare dictionaries instead of using CASE WHEN? Like, use 2 dictionaries, one with source and other with target and get a third one with the replaced values? – Sarvavyapi Sep 07 '21 at 11:55

1 Answers1

1

Two alternatives:

  • Use UDF, in which you can access a dictionary of values
  • Build a table, and perform broadcast join

The way to know which is better is by examining the execution plan, job duration and total shuffle.

Yosi Dahari
  • 6,794
  • 5
  • 24
  • 44