0

I'm looking for the most efficient way to transform a large list of variables (100+) that may or may not exist in my original data frame. Column values are 1 byte. If the value is not NULL, recode with a value of 1. If NUll, recode with a value of 0. Then rename the column to start with a 'U_'.

My code works, but it's terribly inefficient. I'm new to coding in Pyspark and could use some pointers.

update_vars_list = [ 'Col_1','Col_2','Col_3',...'Col_n]                   
for var in update_vars_list :
    if var in original_df.columns:
        original_df= original_df.withColumn(('U_'+var),f.when(f.col(var).isNotNull(),1).otherwise(0)).drop(var)

Example:

enter image description here

Subbu VidyaSekar
  • 2,503
  • 3
  • 21
  • 39
jleab
  • 1
  • 1
  • doing multiple calls to `withColumn`is terribly slow. Instead use `select` to do all your transformations and simply use `alias`to rename your columns. Check out [this answer](https://stackoverflow.com/questions/59789689/spark-dag-differs-with-withcolumn-vs-select) – Siddhant Tandon Oct 28 '20 at 11:53

1 Answers1

0

As is mentioned in the comments, try using an expression with select:

expr = [ f.when(f.col(var).isNotNull(),1).otherwise(0).alias('U_'+var) for var in update_vars_list if var in original_df.columns]
df = original_df.select(*expr)
F4RZ4D
  • 115
  • 2
  • 7