Using df.withColumn() on multiple columns

Question

I am working with python and pyspark to extend the SPSS Modeler.

I want to manipulate ~5000 columns and therefore use the following construct:

for target in targets:
    inputData = inputData.withColumn(target+appendString, function(target))

This is very slow. Is there a more efficent way to do this for all target columns?

targets contains a list of column names to be used, function(target) is a placeholder where I do stuff with different columns like adding and dividing.

I would be happy if you could help me :)

pandayo

Steven · Accepted Answer · 2018-04-23T14:05:41.133

4

try this :

inputData.select(
    '*', 
    *(function(target).alias(target+appendString) for target in targets)
)

edited Apr 23 '18 at 14:05

answered Apr 23 '18 at 14:01

Steven

14,048
6
38
73

1

Can you compare the execution plan of this method vs. the one proposed by OP? I suspect that, while this looks neater, it's actually doing the same thing under the hood. – pault Apr 23 '18 at 14:02
1

This method does not re-affect the dataframe each time. you generate only one dataframe. But yeah, the execution plan is probably the same otherwise – Steven Apr 23 '18 at 14:03
Thank you, this helps a lot. – pandayo Apr 23 '18 at 14:50

Using df.withColumn() on multiple columns

1 Answers1