2

I am working with python and pyspark to extend the SPSS Modeler.

I want to manipulate ~5000 columns and therefore use the following construct:

for target in targets:
    inputData = inputData.withColumn(target+appendString, function(target))

This is very slow. Is there a more efficent way to do this for all target columns?

targets contains a list of column names to be used, function(target) is a placeholder where I do stuff with different columns like adding and dividing.

I would be happy if you could help me :)

pandayo

pandayo
  • 310
  • 2
  • 13

1 Answers1

4

try this :

inputData.select(
    '*', 
    *(function(target).alias(target+appendString) for target in targets)
)
Steven
  • 14,048
  • 6
  • 38
  • 73
  • 1
    Can you compare the execution plan of this method vs. the one proposed by OP? I suspect that, while this looks neater, it's actually doing the same thing under the hood. – pault Apr 23 '18 at 14:02
  • 1
    This method does not re-affect the dataframe each time. you generate only one dataframe. But yeah, the execution plan is probably the same otherwise – Steven Apr 23 '18 at 14:03
  • Thank you, this helps a lot. – pandayo Apr 23 '18 at 14:50