0

I'm trying to get the SQLContext instance from one module in another module. The first module instantiates it to an implicit sqlContext and I had (erroneously) thought that I could then use an implicit parameter in the second module, but the compiler informs me that:

could not find implicit value for parameter sqlCtxt: org.apache.spark.sql.SQLContext

Here's the skeletal setup I have (I have elided imports and details):

-----
// Application.scala
-----

package apps

object Application extends App {
  val env = new SparkEnvironment("My app", ...)

  try {
    // Call methods from various packages that use code from internally DFExtensions.scala
  }
}

-----
// SparkEnvironment.scala
-----

package common

class SparkEnvironment(val app: String, ...) {
  @transient lazy val conf: SparkConf = new SparkConf().setAppName(app)
  @transient implicit lazy val sc: SparkContext = new SparkContext(conf)
  @transient implicit lazy val sqlContext: SQLContext = new SQLContext(sc)
  ...
}

-----
// DFExtensions.scala
-----
package util

object DFExtensions {

  private def myFun(...)(implicit sqlCtxt: SQLContext) = { ... }

  implicit final class DFExt(val df: DataFrame) extends AnyVal {
    // Extension methods for DataFrame where myFun is supposed to be used -- causes exception!
  }
}

Since it's a multi-project sbt setup I don't want to pass around the instance env to all related objects because the stuff in util is really a shared library. Each sub-project (i.e. app) has its own instance created in the main method.

Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.

According to this discussion the implicit sqlContext instance is not in scope, hence compilation fails. I'm not sure a package object would work because the implicit value and parameter are in different packages.

Is what I'm trying to achieve even possible? Is there a better alternative?

The idea is to have several sub-projects that use the same libraries and core functions to share the same project. They are typically updated together, so it's nice to have them in a single place. Most of the library functions directly work on data frames and other structures in Spark, but occasionally I need to do something that requires an instance of SparkContext or SQLContext, for instance write a query with sqlContext.sql as some syntax is not yet natively supported (e.g. flattening with outer lateral views).

Each sub-project has its own main method that creates an implicit instance. Obviously the libraries do not 'know' about this as they are in different packages and I don't pass around the instances. I had thought that somehow implicits are looked for at runtime, so that when an application runs there is an instance of SQLContext defined as an implicit. It's possible that a) it's not in scope because it's in a different package or b) what I'm trying to do is just a bad idea.

Currently there is only one main method because I first have to split the application in multiple components, which I have not done yet.

Just in case it helps:

  • Spark 1.4.1
  • Scala 2.10
  • sbt 0.13.8
Community
  • 1
  • 1
Ian
  • 1,294
  • 3
  • 17
  • 39
  • You could try `import env._`. – Reactormonk Aug 26 '16 at 06:13
  • Good luck working out the implicit lookup rules in Scala - the rules are insane. –  Aug 26 '16 at 06:24
  • How can I import an instance that may have a different name in different main methods? – Ian Aug 26 '16 at 06:35
  • How would that even be possible? How would that even be represented in a symbol table? –  Aug 26 '16 at 06:37
  • Can you update your post on what you are trying to achieve? I don't mean what you have above, but rather in high level terms, what is the point of your design. –  Aug 26 '16 at 06:38
  • Do you mean putting the SparkEnvironment stuff in a package object? – Ian Aug 26 '16 at 06:50
  • I've replicated your problem locally so I will try to see if my proposed solution will work or not. –  Aug 26 '16 at 06:51
  • OK, I just read your update. Package objects are not going to help you here and coercing implicits to behave as you need is not the way to go here. The easy option is just to import as mentioned earlier. –  Aug 26 '16 at 07:09
  • Hmm, but since they are in different sub-projects and the apps depend on the libs, this would require a circular dependency: I have to import an instance from an application... – Ian Aug 26 '16 at 07:14
  • Your design is flawed (sorry to be blunt). There must be a better way to achieve your goals. It's hard to say what that might be without delving deep into your project. –  Aug 26 '16 at 07:16
  • Think about how you would design your solution using Java. Sometimes having all the extra language features in Scala can obscure good judgement. –  Aug 26 '16 at 07:21
  • I'm not a Java dev, so that's not a big help ;-) However, no reason to apologise for bluntness. It's a project I inherited and I'm trying to make it a tad more structured and split out the libraries from the actual application-specific code. Thanks for your feedback! – Ian Aug 26 '16 at 07:34

1 Answers1

3

Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.

Just put the implicit and myFun inside DFExt:

implicit final class DFExt(val df: DataFrame) extends AnyVal {
  private implicit def sqlCtxt: SqlContext = df.sqlContext

  // no need to take an implicit parameter, as sqlCtxt is already in scope
  private def myFun(...) = ...

  // The extension methods can now use sqlCtxt and/or myFun freely
}

You could also make sqlCtxt a val, but then: 1) DFExt can't extend AnyVal anymore; 2) it needs to be initialized even if the extension method you call doesn't need it; 3) any calls to sqlCtxt are likely to be inlined, so you are just accessing a val from df instead of this anyway. If they aren't, this means you are using it far too little to matter.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487