0

let's say I have a spark rdd and need to process it.

rdd.mapPartitionsWithIndex{(index, iter)=>
  def someFunc(){}
  def anotherFunc(){}

  val x = someFunc(iter)
  val y = anotherFunc(index, iter, x)
  x zip y
}

I define the someFunc and anotherFunc inside the mapParititions because I don't want to define them in the driver and then serialize them to the worker. it works, but I can not test it because it's a nested function. how to test this? need to write test case for those functions. currently I can serialize it. but what if the function is not serializable and can not send from driver to worker?

Nan
  • 199
  • 1
  • 12

1 Answers1

2

Whole lambda will be serialized, so also inner functions ;)

You can:

  • create helper object to hold those functions and create test for this object
  • create static nested class

Remember to:

  1. mark all non-serializable fields with @transient
  2. mark your object/class with implements Serializable

You can create also an integration test, which will create Spark Context and run calculations in local mode

More information can be found i.e. here

Community
  • 1
  • 1
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61