I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:
- There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
- CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'},
{'target': 'F', 'source': '0', 'column': 'gender'},
{'target': 'M', 'source': '1', 'column': 'gender'},
{'target': 'F', 'source': 'F', 'column': 'gender'},
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
- The following code generates the CASE WHEN THEN statement that follows.
for value in values:
source_value = value.get("source")
op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown
WHEN (gender = 0) THEN F
WHEN (gender = 1) THEN M
WHEN (gender = F) THEN F
WHEN (gender = Fe) THEN F END'>
- This CASE WHEN THEN is then used to select data from a dataframe.
Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?
I looked at the below questions, but were not relevant for my case. CASE WHEN ... THEN SPARK SQL - case when then
Thanks.