3

What's the difference between UDF and custom expression as far as Spark DataFrame/SQL context is concerned? In particular, are both of them opaque to Catalyst? What are the reasons to use one vs the other?

(Custom expressions were mentioned, for example, here - although in that case they weren't needed.)

Community
  • 1
  • 1
max
  • 49,282
  • 56
  • 208
  • 355
  • I found the answer, but it's not mine. https://forums.databricks.com/answers/2706/view.html. It appears that expressions are kinda like a version of UDF that can participate in Catalyst and Tungsten optimizations. (Normal UDF, even Scala UDFs, can't.) It seems that it needs to be written in Scala, but once it's written, python API can be added. – max Jul 02 '16 at 18:38

1 Answers1

5

UDF:

  • operates on Scala types (you can access UDT)
  • is marked as non-deterministic
  • cannot be moved in execution plan
  • cannot be used for codegen

Expression:

  • operates on catalyst types
  • can be marked as deterministic / non-deterministic
  • can be used for codegen but not all implement
  • can be moved in execution plan

Both - are opaque unless backed by expression specific catalyst rules

b9f516c9
  • 91
  • 2
  • Thanks! Where can I read more details about it? While I have some vague intuition about these, I don't actually know these concepts. For example, I thought "opaque" just means Catalyst can optimize it, but obviously there's more to it. – max Jul 04 '16 at 04:25