spark programming: best way to organize context imports and others with multiple functions

Question

It's easy and simple in the toy examples for showing how to program in spark. You just import, create, use and discard, all in one little function.

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext

def main(args: String) {
  val conf = new SparkConf().setAppName("example")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  val hiveContext = new HiveContext(sc)
  import hiveContext.implicits._
  import hiveContext.sql

  // load data from hdfs
  val df1 = sqlContext.textFile("hdfs://.../myfile.csv").map(...)
  val df1B = sc.broadcast(df1)

  // load data from hive
  val df2 = sql("select * from mytable")
  // transform df2 with df1B
  val cleanCol = udf(cleanMyCol(df1B)).apply("myCol")
  val df2_new = df2.withColumn("myCol", cleanCol)

  ...

  sc.stop()
}

In the real world, I find myself writing quite a few functions to modularize the tasks. For example, I would have a few functions just to load the different data tables. And in these load functions I would call other functions to do necessary data cleaning/transformation as I load the data. Then I would pass the contexts like so:

 def loadHdfsFileAndBroadcast(sc: SparkContext) = {
   // use sc here
   val df = sc.textFile("hdfs://.../myfile.csv").map(...)
   val dfB = sc.broadcast(df)
   dfB
 }

 def loadHiveTable(hiveContext: HiveContext, df1B: Broadcast[Map[String, String]]) = {
   import hiveContext.implicits._
   val data = hiveContext.sql("select * from myHiveTable")
   // data cleaning
   val cleanCol = udf(cleanMyCol(df1B)).apply(col("myCol"))
   df_cleaned = data.withColumn("myCol", cleanCol)
   df_cleaned
 }

As you can see, the load function signatures get heavy quite easily.

I've tried to put these context imports outside the main function inside the class. But that causes problems (see this issue), which leaves me no option but to pass them around.

Is this the way to go or is there a better way to do this?

Vidya · Accepted Answer · 2017-04-04T21:33:52.367

First, let me say I'm glad that someone is exploring writing clean code in Spark. That is something I always find critical, but it always seems like people are so focused on the analytics themselves they lose sight of maintainability.

I do also agree Spark produces interesting challenges in that regard. The best way I've found, and of course you might feel this isn't an improvement, is to use traits with abstract method definitions and mix those into the object that orchestrates everything.

For example:

trait UsingSparkContextTrait {
   def sc: SparkContext

   def loadHdfsFileAndBroadcast = {
      val df = sc.textFile("hdfs://.../myfile.csv").map(...)
      sc.broadcast(df)
 }
}

trait UsingHiveContextTrait {
   def hiveContext: HiveContext
   def df1B: Broadcast[Map[String, String]]
   def loadHiveTable = {
      val data = hiveContext.sql("select * from myHiveTable")
      val cleanCol = udf(cleanMyCol(df1B)).apply(col("myCol"))
      data.withColumn("myCol", cleanCol)
 }
}

And then finally:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext

class ClassDoingWork extends UsingSparkContextTrait with UsingHiveContextTrait {
   val conf = new SparkConf().setAppName("example")
   val sc = new SparkContext(conf) //Satisfies UsingSparkContextTrait
   val sqlContext = new SQLContext(sc)

   val hiveContext = new HiveContext(sc) //Satisfies UsingHiveContextTrait
   val dfb = loadHdfsFileAndBroadcast    //Satisfies UsingHiveContextTrait
   import hiveContext.implicits._
   import hiveContext.sql

   def doAnalytics = {
      val dfCleaned = loadHiveTable
      ...
   }
}

The cool thing about this dependency injection-ish approach is that you will know at compile-time if you are missing any of the components you need for your code to execute.

Finally, on a much simpler note, you can also access the SparkContext from an RDD instance with rdd.context. That could prove useful too.

`sc` shouldn't be transient? In the other ways you risk `NonSerializableException` — T. Gawęda, Apr 04 '17 at 20:36
@Vidya So does this mean that most of my code using contexts will be in the `trait`s? Also I did not get @T. Gaweda's comment. Does he mean that with your solution there won't be any risk of `NonSerializableException`? — breezymri, Apr 04 '17 at 21:21
If you take this approach, yes--basically modularizing among traits rather than functions. Like I said, it is a matter of taste in deciding if that's an improvement over what you already have. As for @T.Gaweda, he is saying the opposite--that there is a risk of an exception if the `SparkContext` instance isn't marked `transient`. And he or she is right. It depends on how the code is written. — Vidya, Apr 04 '17 at 21:23
Something else I had forgotten...you can also fetch the `SparkContext` from the `RDD` made from it by simply calling `rdd.context`. That could help you too. — Vidya, Apr 04 '17 at 21:27

score 1 · Answer 2 · answered Apr 04 '17 at 19:39

1

If all of your methods are defined in a single object/class, you could make the contexts belong to the object/class and always reference the global instance. If you provide it in the constructor you can even safely import only once and have access to the methods everywhere in your class/object.

For instance, if contexts are defined implicitly in calling object

object testObject {
  def main(args: Array[String]): Unit = {
    val sconf = new SparkConf().setMaster("local[2]").setAppName("testObj")
    val rootLogger = Logger.getRootLogger
    rootLogger.setLevel(Level.ERROR)
    implicit val sc = new SparkContext(sconf)
    implicit val sqlContext = new SQLContext(sc)
    new foo().run()
  }
}

you can use them below in the class that actually holds your logic

case class OneVal(value: String)
class foo(implicit val sc: SparkContext, implicit val sqlC: SQLContext){
  import sqlC.implicits._
  def run(): Unit ={
    doStuff().show(1)
    doOtherStuff().show(1)
  }
  def doStuff(): DataFrame ={
    sc.parallelize(List(OneVal("test"))).toDF()
  }
  def doOtherStuff(): DataFrame ={
    sc.parallelize(List(OneVal("differentTest"))).toDF()
  }
}

In this example SQLContext.toDF is the implicit method in this case.

If run, this gives below output as expected

+-----+
|value|
+-----+
| test|
+-----+

+-------------+
|        value|
+-------------+
|differentTest|
+-------------+

answered Apr 04 '17 at 19:39

Davis Broda

4,102
5
23
37

I tried this in my code and it's complaining:Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack: - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@d5a4de9) – breezymri Apr 04 '17 at 20:45
Did you change anything? Also, how did you run the code? For test objects like this I usually just tell my IDE to run it, so if you ran it in a different manner, it might explain why it worked fine for me but failed for you. Because there shouldn't be any need to serialize the sparkcontext. Did you declare the class inside of your object or something, or as an independent instance? – Davis Broda Apr 04 '17 at 20:55
I did not run your exact code. I did declare my logic class outside the object, like what you did. It might be that I had other things in the class that caused that? The log did also point to where I collect an RDD and convert it to map. Not sure if that is the cause though. – breezymri Apr 04 '17 at 21:03
In fact, I had to add to my class (your foo class) `extends Serializable` because it was complaining that the foo class was not serializable. – breezymri Apr 04 '17 at 21:06
If you are doing map operations - or any operation where you pass it a function - where you pass in functions defined in the same object that is probably the cause. If said function has closures it sometimes has to serialize the entire class, which would run into both exceptions you mentioned. Overlooked that possibility. – Davis Broda Apr 05 '17 at 13:13

spark programming: best way to organize context imports and others with multiple functions

2 Answers2