If I do try to use withColumn
using the output of a previous withColumn
it takes hours to complete:
df_spark.filter(some_criteria).withColumn(
'Field2', MyCustomUDF('Field1')).withColumn(
'Field3', MyCustomUDF2('Field2')).write.parquet('Parq.parquet')
However, if I do it in separate steps, it only takes minutes.
#Step 1
df_spark.filter(some_criteria).withColumn(
'Field2',MyCustomUDF('Field1')).write.parquet('TmpFile.parquet')
#Step 2
df_spark2 = spark.read.parquet('TmpFile.parquet')
df_spark2.withColumn(
'Field3',MyCustomUDF2('Field2')).write.parquet('Parq.parquet')
- Is this indeed expected behavior?
- Can I do this more efficiently without writing out a temp file? For instance, can you return two fields from UDF?