Workaround for importing spark implicits everywhere

Question

I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example:

File A
class A {
    def job(spark: SparkSession) = {
        import spark.implcits._
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }
    private def doSomething(ds: Dataset[Foo], spark: SparkSession) = {
        import spark.implicits._
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    def doSomething(ds: Dataset[Foo]) = {
        import spark.implicits._
        ds.map(e => "SomeString")
    }
}

What I wanted to ask is if there's a cleaner way to be able to do

ds.map(e => "SomeString")

without importing implicits in every function where I do the map? If I don't import it, I get the following error:

Error:(53, 13) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

Updated my answer with some additional information. Don't hesitate to ask if you have any follow-up questions. — Shaido, Aug 23 '17 at 09:31
After a long time I found a more convenient method to do this. Added it to the answer below. — Shaido, Oct 10 '19 at 09:31

Shaido · Answer 1 · 2019-10-10T09:32:03.970

Something that would help a bit would be to do the import inside the class or object instead of each function. For your "File A" and "File B" examples:

File A
class A {
    val spark = SparkSession.builder.getOrCreate()
    import spark.implicits._

    def job() = {
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }

    private def doSomething(ds: Dataset[Foo]) = {
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    import spark.implicits._

    def doSomething(ds: Dataset[Foo]) = {    
        ds.map(e => "SomeString")
    }
}

In this way, you get a manageable amount of imports.

Unfortunately, to my knowledge there is no other way to reduce the number of imports even more. This is due to the need to the SparkSession object when doing the actual import. Hence, this is the best that can be done.

Update:

An even more convinient method is to create a Scala Trait and combine it with an empty Object. This allows for easy import of implicits at the top of each file while allowing extending the trait to use the SparkSession object.

Example:

trait SparkJob {
  val spark: SparkSession = SparkSession.builder.
    .master(...)
    .config(..., ....) // Any settings to be applied
    .getOrCreate()
}

object SparkJob extends SparkJob {}

With this we can do the following for File A and B:

File A:

import SparkJob.spark.implicits._
class A extends SparkJob {
  spark.sql(...) // Allows for usage of the SparkSession inside the class
  ...
}

File B:

import SparkJob.spark.implicits._
class B extends SparkJob {
  ...    
}

Note that it's only necessary to extend SparkJob for for the classes or objects that use the spark object itself.

Excellent answer, so sad you got zero upvotes in two years. Fixed! — akuhn, Aug 21 '19 at 11:26

score 0 · Answer 2 · answered May 26 '20 at 19:17

You could re-use existing SparkSession in each called method.. by creating local val inside method -

val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active

And then

import spark.implicits._

Working fine for me so far..

Workaround for importing spark implicits everywhere

2 Answers2

Linked