12

I need to define custom methods on DataFrame. What is the better way to do it? The solution should be scalable, as I intend to define a significant number of custom methods.

My current approach is to create a class (say MyClass) with DataFrame as parameter, define my custom method (say customMethod) in that and define an implicit method which converts DataFrame to MyClass.

implicit def dataFrametoMyClass(df: DataFrame): MyClass = new MyClass(df)

Thus I can call:

dataFrame.customMethod()

Is this the correct way to do it? Open for suggestions.

Martin Senne
  • 5,939
  • 6
  • 30
  • 47
Pravin Gadakh
  • 603
  • 1
  • 9
  • 19

3 Answers3

26

Your way is the way to go (see [1]). Even though I solved it a little different, the approach stays similar:

Possibility 1

Implicits

object ExtraDataFrameOperations {
  object implicits {
    implicit def dFWithExtraOperations(df: DataFrame) = DFWithExtraOperations(df)
  }
}

case class DFWithExtraOperations(df: DataFrame) {
  def customMethod(param: String) : DataFrame = {
    // do something fancy with the df
    // or delegate to some implementation
    //
    // here, just as an illustrating example: do a select
    df.select( df(param) )
  }
}

Usage

To use the new customMethod method on a DataFrame:

import ExtraDataFrameOperations.implicits._
val df = ...
val otherDF = df.customMethod("hello")

Possibility 2

Instead of using an implicit method (see above), you can also use an implicit class:

Implicit class

object ExtraDataFrameOperations {
  implicit class DFWithExtraOperations(df : DataFrame) {
     def customMethod(param: String) : DataFrame = {
      // do something fancy with the df
      // or delegate to some implementation
      //
      // here, just as an illustrating example: do a select
      df.select( df(param) )
    }
  }
}

Usage

import ExtraDataFrameOperations._
val df = ...
val otherDF = df.customMethod("hello")

Remark

In case you want to prevent the additional import, turn the object ExtraDataFrameOperations into an package object and store it in in a file called package.scala within your package.

Official documentation / references

[1] The original blog "Pimp my library" by M. Odersky is available at http://www.artima.com/weblogs/viewpost.jsp?thread=179766

Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • thank you for very profound answer! Which possibility is easier to adapt to be able of doing "import spark.implicits._"? The former annoyingly needs SparkSession object and this is a headache! – vak Jun 18 '18 at 16:14
12

There is a slightly simpler approach: just declare MyClass as implicit

implicit class MyClass(df: DataFrame) { def myMethod = ... }

This automatically creates the implicit conversion method (also called MyClass). You can also make it a value class by adding extends AnyVal which avoids some overhead by not actually creating a MyClass instance at runtime, but this is very unlikely to matter in practice.

Finally, putting MyClass into a package object will allow you to use the new methods anywhere in this package without requiring import of MyClass, which may be a benefit or a drawback for you.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
  • thank you for very concise solution! Similarly to my question to Martin: is it possible to adapt this solution to be able of nicely doing "import spark.implicits._"? The former annoyingly needs SparkSession object and this is a headache! – vak Jun 18 '18 at 16:16
  • 1
    How is it a headache? If the problem is that you have to pass the SparkSession alongside the DataFrame, you don't, it's available: you can write `import df.sparkSession.implicits._`. – Alexey Romanov Sep 21 '18 at 06:32
0

I think you should add an implicit conversion between DataFrame and your custom wrapper, but use an implicit clas - this should be the easiest to use and you will store your custom methods in one common place.

   implicit class WrappedDataFrame(val df: DataFrame) {
        def customMethod(String arg1, int arg2) {
           ...[do your stuff here]
        }
     ...[other methods you consider useful, getters, setters, whatever]...
      }

If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:

df.customMethod("test", 100)

TheMP
  • 8,257
  • 9
  • 44
  • 73