Most efficient way to transform and rename 1,000+ columns with Pyspark?

Question

I'm looking for the most efficient way to transform a large list of variables (100+) that may or may not exist in my original data frame. Column values are 1 byte. If the value is not NULL, recode with a value of 1. If NUll, recode with a value of 0. Then rename the column to start with a 'U_'.

My code works, but it's terribly inefficient. I'm new to coding in Pyspark and could use some pointers.

update_vars_list = [ 'Col_1','Col_2','Col_3',...'Col_n]                   
for var in update_vars_list :
    if var in original_df.columns:
        original_df= original_df.withColumn(('U_'+var),f.when(f.col(var).isNotNull(),1).otherwise(0)).drop(var)

Example:

doing multiple calls to `withColumn`is terribly slow. Instead use `select` to do all your transformations and simply use `alias`to rename your columns. Check out [this answer](https://stackoverflow.com/questions/59789689/spark-dag-differs-with-withcolumn-vs-select) — Siddhant Tandon, Oct 28 '20 at 11:53

score 0 · Answer 1 · answered Oct 28 '20 at 19:09

0

As is mentioned in the comments, try using an expression with select:

expr = [ f.when(f.col(var).isNotNull(),1).otherwise(0).alias('U_'+var) for var in update_vars_list if var in original_df.columns]
df = original_df.select(*expr)

answered Oct 28 '20 at 19:09

F4RZ4D

115
2
7

Thank you! This is exactly what I needed. – jleab Oct 29 '20 at 20:12
You are welcome, please upvote the answer if it helped! – F4RZ4D Oct 29 '20 at 20:31

Most efficient way to transform and rename 1,000+ columns with Pyspark?

1 Answers1