2

I am trying to teach myself Scala whilst at the same time trying to write code that is idiomatic of a functional language, i.e. write better, more elegant, functional code.

I have the following code that works OK:

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
  implicit class DataFrameExtensions(df: DataFrame){
    def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
    def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
  }
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
val dataFrames = Seq(df.featuresGroup1(groupBy, asAt),df.featuresGroup2(groupBy, asAt))

The last line bothers me though. The two functions (featuresGroup1, featuresGroup2) both have the same signature:

scala> :type df.featuresGroup1(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame

scala> :type df.featuresGroup2(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame

and take the same vals as parameters so I assume I can write that line in a more functional way (perhaps using .map somehow) that means I can write the parameter list just once and pass it to both functions. I can't figure out the syntax though. I thought maybe I could construct a list of those functions but that doesn't work:

scala> Seq(featuresGroup1, featuresGroup2)
<console>:23: error: not found: value featuresGroup1
       Seq(featuresGroup1, featuresGroup2)
           ^
<console>:23: error: not found: value featuresGroup2
       Seq(featuresGroup1, featuresGroup2)
                           ^

Can anyone help?

jamiet
  • 10,501
  • 14
  • 80
  • 159

5 Answers5

2

I thought maybe I could construct a list of those functions but that doesn't work

You need to explicitly perform eta expansion to turn methods into functions (they are not the same in Scala), by using an underscore operator:

val funcs = Seq(featuresGroup1 _, featuresGroup2 _)

or by using placeholders:

val funcs = Seq(featuresGroup1(_, _), featuresGroup2(_, _))

And you are absolutely right about using map operator:

val dataFrames = funcs.map(f => f(groupBy, asAdt))

I strongly recommend against using implicits of types String or Seq, as if used in multiple places, these lead to subtle bugs that are not immediately obvious from the code and the code will be prone to breaking when it's moved somewhere.

If you want to use implicits, wrap them into a custom types:

case class DfGrouping(groupBy: Seq[String]) extends AnyVal

implicit val grouping: DfGrouping = DfGrouping(Seq("a", "b"))
Oleg Pyzhcov
  • 7,323
  • 1
  • 18
  • 30
  • Thx Oleg. I tried `val funcs = Seq(featuresGroup1 _, featuresGroup2 _)` but that failed: `:24: error: not found: value featuresGroup1`. Could this be because those functions are defined in an implicit class? – jamiet May 23 '18 at 06:54
  • `Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))` does the trick – jamiet May 23 '18 at 07:55
  • I appreciate your recommendation not to use implicits. I admit when I saw @vindev's suggestion it left me slightly uneasy as I wondered "what if I've got multiple implicit values" and did think that subtle bugs could creep in. Thank you for that. – jamiet May 23 '18 at 08:27
2

I thought maybe I could construct a list of those functions but that doesn't work:

Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?

Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))

df.featuresGroup1 _ should work as well.

df.featuresGroup1 by itself would work if you had an expected type, e.g.

val dataframes: Seq[(Seq[String], LocalDate) => DataFrame] = 
  Seq(df.featuresGroup1, df.featuresGroup2)

but in this specific case providing the expected type is more verbose than using lambdas.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
  • *Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?* Probably because I'm still learning – jamiet May 23 '18 at 07:36
  • Your suggestion seems to work and is closest to the pure functional approach I was envisaging. Let me just check it all through and then I shall accept your answer. thank you :) – jamiet May 23 '18 at 07:38
  • P.S. There's a small typo in your answer. `asAt`, not `asAdt` – jamiet May 23 '18 at 07:41
  • *in this specific case providing the expected type is more verbose than using lambdas* Agreed. I prefer the type inferencing provided by the lambda approach – jamiet May 23 '18 at 07:42
  • `Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))` does work. `Seq(df.featuresGroup1 _, df.featuresGroup2 _).map(_(groupBy, asAt))` does not, it fails with error **:26: error: missing argument list for method featuresGroup1 in class DataFrameExtensions | Unapplied methods are only converted to functions when a function type is expected. | You can make this conversion explicit by writing `featuresGroup1 _` or `featuresGroup1(_,_)` instead of `featuresGroup1`.** – jamiet May 23 '18 at 07:45
  • That's weird, it even suggests ` _`. Since you are using Spark, it's probably an older Scala version, but I don't remember any recent changes linked to this. – Alexey Romanov May 23 '18 at 08:03
  • I'm glad you think its weird, I concur :) . Scala version is 2.11.12 – jamiet May 23 '18 at 08:08
  • Still got that typo :) – jamiet May 23 '18 at 08:28
  • 1
    Fixed the typo. – Alexey Romanov May 23 '18 at 08:50
1

Why no just create a function in DataFrameExtensions to do so?

def getDataframeGroups(groupBy: Seq[String], asAt: String) = Seq(featuresGroup1(groupBy,asAt), featuresGroup2(groupBy,asAt))
RoberMP
  • 1,306
  • 11
  • 22
  • thx @RoberMP. Thx, yes that works. I'm not going to accept your answer yet though (I promise I will later) as I'm interested to read any other suggestions that might be posited, I'm particularly interested to know if this can be done with a `.map()` rather than using a *helper* function (which is how I think of your solution). – jamiet May 23 '18 at 07:29
0

I think you could create a list of functions as below:

val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame]  = List(_.featuresGroup1, _.featuresGroup1)
funcs.map(x => x(df)(groupBy, asAt))

It seems you have a list of functions which convert a DataFrame to another DataFrame. If that is the pattern, you could go a little bit further with Endo in Scalaz

Binzi Cao
  • 1,075
  • 5
  • 14
  • thx Binzi. I tried: `val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame] = List(_.featuresGroup1, _.featuresGroup1)` which failed with `:24: error: type mismatch;|found : org.apache.spark.sql.DataFrame|(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]|required: (Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame|(which expands to) (Seq[String], java.time.LocalDate) => org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]` – jamiet May 23 '18 at 06:59
  • It works well from my here funcs: List[org.apache.spark.sql.DataFrame => ((Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame)] = List(, ) res1: List[org.apache.spark.sql.DataFrame] = List([number: int, word: string], [number: int, word: string]) – Binzi Cao May 23 '18 at 07:09
  • tried that, got `:1: error: ';' expected but '=' found.` :( . Sorry, I'm sure I'm just being a dumb newbie but I just can't figure out where the problem is. – jamiet May 23 '18 at 07:18
  • Sorry my reply was not code, it was the return result of the lines of code in the answer. I used your sample code in the questions. it works well with my answer. maybe there is something in your REPL – Binzi Cao May 23 '18 at 07:22
  • ah, OK. Showing my naivety :) I'll keep investigating. thx. At the time of writing Oleg's suggestion seems closest to the pure functional solution I envisaged, but it doesn't work for my scenario and I can't figure out why :( – jamiet May 23 '18 at 07:32
  • 1
    If you copy an screen shot of the code and error, I may help to find it out. This is actually one of the most important pure functional concepts, `curried` and `uncurried`. The data type in the list is a curried function. Scala supports curried function. I would suggest you do more investigations about the error. Curried function is a very important concept in FP. In FP, everything can be made as a function. – Binzi Cao May 23 '18 at 08:50
  • I'm going to post a repro as a github gist. Watch this space. – jamiet May 23 '18 at 09:31
  • I got your code to work. not sure what I was doing wrong yesterday. Posted here: https://gist.github.com/jamiekt/cea2dab3ea8de91489b31045b302e011 as a gist. Your code is on these two lines: https://gist.github.com/jamiekt/cea2dab3ea8de91489b31045b302e011#file-script-scala-L28-L29 – jamiet May 24 '18 at 12:31
0

I like this answer best, courtesy of Alexey Romanov.

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
  implicit class DataFrameExtensions(df: DataFrame){
    def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
    def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
  }
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
jamiet
  • 10,501
  • 14
  • 80
  • 159