0

My code is like this:

dataFrame.createOrReplaceTempView("tableName1")
//udf1 and udf2 will return a struct
var dataFrame1 = sql("select udf1(a) as col1,udf2(a) as col2 from tableName1");
dataFrame1.createOrReplaceTempView("tableName2")
var dataFrame2 = sql("select col1.x,col1.y,col1.z,col2.x,col2.y,col2.z from tableName2");
dataFrame2.write.parquet("path")

I see the physical plan in Spark UI is like this:

select udf1(a).x,udf1(a).y,udf1(a).z,udf2.x,udf2.y,udf2.z from tableName2

It seems like that the udf1 and udf2 will be invoked 3 times。 What I really want is that udf1 and udf2 only compute once。 Hope some can help me.

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
cxco
  • 143
  • 2
  • 13
  • 2
    I don't think that the udf is called 3 times. I would use an LongAccumulator to count then UDF calls directly and then compare with the number of rows in your dataframe – Raphael Roth Jan 18 '18 at 12:46
  • @RaphaelRoth I add a log in the udf1. when I run this sql, every row have 3 logs of udf1. – cxco Jan 21 '18 at 07:29
  • it's possible that the udf is really computed more than once, see e.g. https://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns, You could try to influence the optimizer, e.g. by caching dataFrame1: `var dataFrame1 = sql("select udf1(a) as col1,udf2(a) as col2 from tableName1").cache();` – Raphael Roth Jan 21 '18 at 07:49

0 Answers0