I'm looking for the most efficient way to transform a large list of variables (100+) that may or may not exist in my original data frame. Column values are 1 byte. If the value is not NULL, recode with a value of 1. If NUll, recode with a value of 0. Then rename the column to start with a 'U_'.
My code works, but it's terribly inefficient. I'm new to coding in Pyspark and could use some pointers.
update_vars_list = [ 'Col_1','Col_2','Col_3',...'Col_n]
for var in update_vars_list :
if var in original_df.columns:
original_df= original_df.withColumn(('U_'+var),f.when(f.col(var).isNotNull(),1).otherwise(0)).drop(var)
Example: