0

When calling to a function from external class, in case of many calls, what will give me a better performance, lazy val function or def method? So far, what I understood is:

def method-

  1. Defined and tied to a class, needed to be declare inside "object" in order to be called as java static style.
  2. Call-by-name, evaluated only when accessed, and every accessed.

lazy val lambda expression -

  1. Tied to object Function1/2...22
  2. Call-by-value, evaluated the first time get accessed and evaluated only one time.
  3. Is actually def apply method tied to a class.

So, it may seem that using lazy val will reduce the need to evaluate the function every time, should it be preferred ?

I faced that when i'm producing UDF for Spark code, and i'm trying to understand which approach is better.

object sql {
  def emptyStringToNull(str: String): Option[String] = {
    Option(str).getOrElse("").trim match {
      case "" => None
      case "[]" => None
      case "null" => None
      case _ => Some(str.trim)
    }
  }

  def udfEmptyStringToNull: UserDefinedFunction = udf(emptyStringToNull _)

  def repairColumn_method(dataFrame: DataFrame, colName: String): DataFrame = {
    dataFrame.withColumn(colName, udfEmptyStringToNull(col(colName)))
  }

  lazy val repairColumn_fun: (DataFrame, String) => DataFrame = { (df,colName) =>
    df.withColumn(colName, udfEmptyStringToNull(col(colName)))
  }
}
Maor Aharon
  • 312
  • 3
  • 14
  • 3
    Show us some code. If you need the code to be evaluated each time use a `def`, if you want a single evaluation use `lazy val` but only if the computation is heavy and you have some case where the `val` would not be evaluated. In case of a `udf` in Spark, I'm pretty sure a `lazy val` doesn't bring much benefit. – Gaël J Aug 14 '21 at 18:50
  • I added some code. IMHO this is global question related not just to Spark, but also every time I use heavily in the same function, what is the best practice. – Maor Aharon Aug 15 '21 at 10:26
  • _everytime I use heavily the same function_ is it a pure function (no side effect)? Always called with the same parameters? – Gaël J Aug 15 '21 at 10:33
  • Note that Spark Dataframe are lazy by definition. No computation will happen until you call a "terminal" operation like collect/write.. In the same spirit a UDF is a description of a function to apply. – Gaël J Aug 15 '21 at 10:36
  • Yes, it's pure function and with same parameter. Can you remove the dislike in order to let other person view the question? – Maor Aharon Aug 15 '21 at 11:50
  • Then yes it makes no sense to call it multiple times. – Gaël J Aug 15 '21 at 11:52
  • I'm not sure, that's why I'm asking, `val` tied to an object, therefore it could be locked while `def` is not. – Maor Aharon Aug 15 '21 at 12:05
  • 1
    Does this answer your question? [What's the (hidden) cost of Scala's lazy val?](https://stackoverflow.com/questions/3041253/whats-the-hidden-cost-of-scalas-lazy-val) – Gaël J Aug 15 '21 at 12:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236038/discussion-between-maor-aharon-and-gael-j). – Maor Aharon Aug 16 '21 at 07:09

1 Answers1

3

There's no need for you to use a lazy val in this specific case. When you assign a function to a lazy val, its results are not memoized, as you seem to think they are. Since the function itself is a plain function literal and not the result of an expensive computation (regardless of what goes on inside it), making it lazy is not useful. All it does is add overhead when accessing and calling it. A simple val would be better, but making it a proper method would be best.

If you want memoization, see Is there a generic way to memoize in Scala? instead.

Ignoring your specific example, if the def in question didn't take any arguments and both it and the lazy val were simple values that were expensive to compute, I would go with the lazy val if you're going to call it many times to avoid computing it over and over again.

If they were values that were very cheap to compute and you're not going to call it many times, or if they're expensive to compute but you're only going to call them once, I would go with a def instead. There wouldn't be much difference if you used a lazy val instead, but it would avoid making a couple of fields.

If they're somewhat cheap to compute but they're being called many times, it may be better to use a lazy val simply because they'll be cached. However, you might want to look at your overall design before looking at such micro-optimizations.

user
  • 7,435
  • 3
  • 14
  • 44
  • Why do you recommend `lazy val` for simple compute heavily used rather then `def`, and in apposite, why would I prefer `def` for complicated compute heavily used ? – Maor Aharon Aug 16 '21 at 06:39
  • Above, at the beginning you said "When you assign a function to a `lazy val`, its results are not memorized, as you seem to think they are." but lastly you also said "it may be better to use a `lazy val` simply because they'll be cached." You got me totality confused... – Maor Aharon Aug 16 '21 at 12:30
  • 1
    @MaorAharon A `lazy val` will basically cache a single value. [Memoization](https://en.wikipedia.org/wiki/Memoization) is different - it remembers the outputs for particular inputs to a *function*. `lazy val`s are not made for functions. Does that help? – user Aug 16 '21 at 13:11
  • Ok, thanks! So if I have a simple Spark function which only convert String column into Int one, declare it as lazy val function literal won’t eliminate the need to evaluate it everything I execute it and as a result will save computing time? – Maor Aharon Aug 16 '21 at 18:23
  • @MaorAharon No, that would require memoization. There'd still be some overhead from hashing, but it'd probably be worth it. See [this question](https://stackoverflow.com/questions/16257378/is-there-a-generic-way-to-memoize-in-scala) for how to do that. Good luck! – user Aug 16 '21 at 18:34