Confused with Spark serilization

Question

I need to read several csv files and convert several columns from string to Double.

The code is like:

  def f(s:String):Double = s.toDouble

  def readonefile(path:String) = {
    val data = for {
      line <-  sc.textFile( path )
      arr = line.split(",").map(_.trim)
      id = arr(33)
    } yield {
        val countings = ((9 to 14) map arr).toVector map f
        id -> countings.toVector
      }
    data
  }

If I write the toDouble explicitly (e.g. function f in the code), spark throws error java.io.IOException or java.lang.ExceptionInInitializerError.

However if I change countings to

val countings = ((9 to 14) map arr).toVector map (_.toDouble)

Then everything works fine.

Is function f serializable?

EDIT:

Some people says it is the same as Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects But why doesn't it throw Task not serializable exception?

Scala version 2.10

Spark version 1.3.1

Environment: yarn-client

possible duplicate of [Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects](http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou) — maasg, Jun 01 '15 at 08:47
@maasg , I though they are similar. But how to explain the error message it throws. I expected to see NotSerializableException. — worldterminator, Jun 02 '15 at 02:18
I'm a bit confused with the primary for loop you are doing, can you explain what you are trying to accomplish? — Holden, Jun 03 '15 at 00:10
@Holden I am reading lines from a csv file. arr is an array stored all columns values of a line in the csv file. I want column 9 to column 14 be converted to Double. Column 33 is an id of this line. Finally I get a RDD[ id -> column9 to column 14 doubles ] — worldterminator, Jun 03 '15 at 03:34

score 0 · Answer 1 · answered Jun 03 '15 at 05:48

0

We can move the function f into a companion object. I also made the transformations avoid the for loop which I'm not sure its doing what you want. Note, you might want to use spark-csv instead of just splitting on commas but hopefully this illustrates it:

  object Panda {
    def f(s:String):Double = s.toDouble
  }

  def readonefile(path:String) = {
      val input = sc.textFile( path )
      arrs = input.map(line => line.split(",").map(_.trim))
      arrrs.map(arr => (arr(33).toDouble,
                        ((9 to 14) map arr).map(Panda.f).toVector)
  }

answered Jun 03 '15 at 05:48

Holden

7,392
1
27
33

Hi @Holden, I didn't see how it solved my problem. Why does the program throws java.io.IOException or java.lang.ExceptionInInitializerError ? – worldterminator Jun 04 '15 at 04:31
It puts f in an campion object. – Holden Jun 04 '15 at 04:46
The first thing is that I am not sure what causes the error, possible serialization problem. And why companion object can fix it? – worldterminator Jun 04 '15 at 04:57
Having the companion object keeps Spark from having to try to serialize the entire class – Holden Jun 04 '15 at 05:05

Confused with Spark serilization

1 Answers1