11

I use Scala implicit classes to extend objects I work with frequently. As an example, I have a method similar to this defined on Spark DataFrame:

implicit class DataFrameExtensions(df: DataFrame) {
  def deduplicate: Boolean = 
    df.groupBy(df.columns.map(col): _*).count
}

But implicit defs are not invoked if the class already defines the same method. What happens if I later upgrade to a new version of Spark that defines a DataFrame#deduplicate method? Client code will silently switch to the new implementation, which might cause subtle errors (or obvious ones, which are less problematic).

Using reflection, I can throw a runtime error if DataFrame already defines deduplicate before my implicit defines it. Theoretically, then, if my implicit method conflicts with an existing one, I can detect it and rename my implicit version. However, once I upgrade Spark, run the app, and detect the issue, it's too late to use the IDE to rename the old method, since any references to df.deduplicate now refer to the native Spark version. I would have to revert my Spark version, rename the method through the IDE, and then upgrade again. Not the end of the world, but not a great workflow.

Is there a better way to deal with this scenario? How can I use the "pimp my library" pattern safely?

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
Sasgorilla
  • 2,403
  • 2
  • 29
  • 56
  • 5
    I believe the safest (and simplest) thing to do would be to use a simple functions instead of defining an implicit class. `def deduplicate(df: DataFrame): Boolean = df.groupBy(df.columns.map(col): _*).count` – stefanobaghino May 14 '18 at 14:01
  • 1
    Not ideal, but you could use names that are unlikely to be introduced, e.g. `deduplicate_` or any other convention. – Łukasz May 14 '18 at 14:20
  • 2
    This seems like a variation on the [fragile base class problem](https://en.wikipedia.org/wiki/Fragile_base_class), but implicit-based enrichment doesn't impose any explicit relationship on the classes involved. So the compiler doesn't see a problem. – Joe Pallas May 14 '18 at 18:52
  • You should take a look at [typeclasses](https://blog.scalac.io/2017/04/19/typeclasses-in-scala.html) – James May 23 '18 at 05:59
  • 1
    Somewhat meta and off-topic, but does anyone else find it strange that we have three different unrelated tags for Scala implicits? – Andrey Tyukin May 25 '18 at 02:08
  • 1
    @AndreyTyukin with Scala there are always too many ways to express yourself. – som-snytt May 26 '18 at 18:05
  • @James Could you illustrate how typeclasses would help here? Seems like they would suffer the same problem. – Sasgorilla May 30 '18 at 00:05

3 Answers3

6

You could add a test that ensures that certain code snippets do not compile into the test suite of DataFrameExtension. Maybe something like this:

"(???: DataFrame).deduplicate" shouldNot compile

If it compiles without your implicit conversion, then it means that the method deduplicate has been introduced by the Spark library. In this case, the test fails, and you know that you have to update your implicits.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • Clever. Maybe too much. ;-) – stefanobaghino May 24 '18 at 07:11
  • @stefanobaghino It's a single line <50 characters long. Don't see in which dimension it is "too much". You gonna test your embedded DSL's *somehow*, so... – Andrey Tyukin May 24 '18 at 10:24
  • My point is that for simple things like in the example, probably an embedded DSL built on an external library is an abuse of the power of the language. You've found a clever and clean way to easily spot issues, but for example questions like how to solve these errors remain open. Is renaming a viable option? Can we safely break the API? My point is not against your solution, but rather in favor of not dealing with this problem. I was proposing a perspective regarding the question itself, as seen through your brilliant way to approach the proposed problem. By the way, +1. – stefanobaghino May 24 '18 at 12:35
  • @stefanobaghino As Joe Pallas has remarked above, defining eDSL's using the pimp-my-library pattern introduces a version of a fragile base class problem. If you decide to pimp someone else's library, you are relying on a fragile base class that is under someone else's control, so you should at least monitor whether changes to this class break your code. The alternative would be to not use the pimp-my-library pattern, but that's not the question. I'm not sure what you mean by "embedded DSL ... external library ... abuse of power". It's ScalaTest, de-facto standard testing framework for Scala. – Andrey Tyukin May 24 '18 at 12:46
  • As I mentioned, I was proposing a different perspective on the question. – stefanobaghino May 24 '18 at 12:52
2

If the extension method is enabled by an import, use -Xlint to show that the import is no longer used:

//class C
class C { def x = 17 }

trait T {
  import Extras._
  def f = new C().x
}

object Extras {
  implicit class X(val c: C) {
    def x = 42
  }
}

Another view, where the evidence must be used under -Xlint -Xfatal-warnings:

//class C[A]
class C[A] { def x = 17 }

trait T {
  import Mine.ev
  val c = new C[Mine]
  def f = c.x
}

trait Mine
object Mine {
  implicit class X[A](val c: C[A]) {
    def x(implicit @deprecated("unused","") ev: Mine) = 42
  }
  implicit val ev: Mine = null
}

object Test {
  def main(args: Array[String]): Unit = println {
    val t = new T {}
    t.f
  }
}
som-snytt
  • 39,429
  • 2
  • 47
  • 129
  • This is cool. I'm a little leery of relying on particular compiler flags to alert me to the problem, since those might easily be changed by unwitting third parties. But seems like a good layer of defense-in-depth. – Sasgorilla May 29 '18 at 17:38
0

The solution for doing it safely is to ask explicitly for an extended data frame, to minimize the impact, you can use the implicit to have a nice syntax for conversions (like toJava/toScala, etc):

implicit class DataFrameExtSyntax(df: DataFrame) { 
 def toExtended: DataFrameExtensions = DataFrameExtensions(df)
}

And then your invocation will look:

myDf.asExtended
  .deduplicate
  .someOtherExtensionMethod
  .andMore

That way you're future-proofing your extension methods without runtime checks/linting/unit-test tricks ( You can even use myDf.ext it myDf.toExtended is too long :) )

GClaramunt
  • 3,148
  • 1
  • 21
  • 35
  • 1
    But doesn't `toExtended` suffer the same problem? With `toJava` this problem does not exist (because both the extension method and the conflicting one have to come from the same library), while `toScala` could suffer from this problem but it's very unlikely because the name is "namespaced" — which seems a problematic pattern. – Blaisorblade May 23 '18 at 11:06
  • 1
    potentially, yes, you can pick even a weirder name... but yeah, for extra safety, I would add the "shouldNot compile" test – GClaramunt May 23 '18 at 19:56
  • I would take exception to linting as a trick, but you goaded me to an alternative, where an implicit, if used, indicates that your API was definitively used. Your solution is the classic advice, of course, but @Blaisorblade is sharp, pun intended. – som-snytt May 27 '18 at 06:07
  • 1
    I'm not wild about this solution, since it slightly reduces readability and introduces the mental overhead of remembering which of my dataframe methods are "extended" and which are native, but I think this is the most future-proof plan. Other answers have great answers for *detecting* a conflict, but no ways to fix it once detected besides reverting the Spark version and changing the name. Too bad there's no Scala `method_missing` ... if there were I could just get in the habit of always using the extended class, and delegate native methods back to the original dataframe. – Sasgorilla May 29 '18 at 17:36
  • the Dynamic trait (https://www.scala-lang.org/api/2.12.4/scala/Dynamic.html) behaves like a "method missing" – GClaramunt May 30 '18 at 18:17
  • @Sasgorilla By the way, a reminder: awarding the bounty is unrelating to accepting an answer. If you found this answer useful, then please take a second to [mark the question as solved](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – Andrey Tyukin May 30 '18 at 20:31
  • Another hiccup here. I don’t think the answer is quite right as written. I can call `df.ext.xm.nm`, but not `df.ext.xm.xm` (`xm` = extension method, `nm` = native method), because the return value of `df.ext.xm` is a `df`, not a `df.ext`. If I wrap all return values in `DataFrameExtensions` and provide an implicit conversion `DataFrameExtensions` -> `DataFrame`, I get further, but not all of the way there (e.g., I can do `df.ext.xm.xm.nm`, but not `df.ext.xm.nm.xm`). I don’t see a way to freely and transparently chain both native and extension method calls with this patten. – Sasgorilla Jul 23 '18 at 16:54