PySpark SQL: consolidating .withColumn calls

Question

I have an RDD that I've converted into a Spark SQL DataFrame. I want to do a number of transformations of columns with UDFs, which ends up looking something like this:

df = df.withColumn("col1", udf1(df.col1))\
       .withColumn("col2", udf2(df.col2))\
       ...
       ...
       .withColumn("newcol", udf(df.oldcol1, df.oldcol2))\
       .drop(df.oldcol1).drop(df.oldcol2)\
       ...

etc.

Is there is a more concise way to express this (both the repeated withColumn and drop calls)?

score 1 · Accepted Answer · edited May 23 '17 at 12:01

You can pass several operations in one expression.

exprs = [udf1(col("col1")).alias("col1"),
         udf2(col("col2")).alias("col2"),
         ...
         udfn(col("coln")).alias("coln")]

And then unpack them inside a select:

df = df.select(*exprs)

So, taking this approach you will execute such udfs over your df and you will rename the resulting columns. Note that my answer is almost exactly like this, however the question was totally different from mine, so this is why I decided to answer it and not flag it as duplicate.

PySpark SQL: consolidating .withColumn calls

1 Answers1