0

I need to read several csv files and convert several columns from string to Double.

The code is like:

  def f(s:String):Double = s.toDouble

  def readonefile(path:String) = {
    val data = for {
      line <-  sc.textFile( path )
      arr = line.split(",").map(_.trim)
      id = arr(33)
    } yield {
        val countings = ((9 to 14) map arr).toVector map f
        id -> countings.toVector
      }
    data
  }

If I write the toDouble explicitly (e.g. function f in the code), spark throws error java.io.IOException or java.lang.ExceptionInInitializerError.

However if I change countings to

val countings = ((9 to 14) map arr).toVector map (_.toDouble)

Then everything works fine.

Is function f serializable?

EDIT:

Some people says it is the same as Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects But why doesn't it throw Task not serializable exception?

Scala version 2.10

Spark version 1.3.1

Environment: yarn-client

Community
  • 1
  • 1
worldterminator
  • 2,968
  • 6
  • 33
  • 52
  • 1
    possible duplicate of [Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects](http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou) – maasg Jun 01 '15 at 08:47
  • @maasg , I though they are similar. But how to explain the error message it throws. I expected to see NotSerializableException. – worldterminator Jun 02 '15 at 02:18
  • I'm a bit confused with the primary for loop you are doing, can you explain what you are trying to accomplish? – Holden Jun 03 '15 at 00:10
  • @Holden I am reading lines from a csv file. arr is an array stored all columns values of a line in the csv file. I want column 9 to column 14 be converted to Double. Column 33 is an id of this line. Finally I get a RDD[ id -> column9 to column 14 doubles ] – worldterminator Jun 03 '15 at 03:34

1 Answers1

0

We can move the function f into a companion object. I also made the transformations avoid the for loop which I'm not sure its doing what you want. Note, you might want to use spark-csv instead of just splitting on commas but hopefully this illustrates it:

  object Panda {
    def f(s:String):Double = s.toDouble
  }

  def readonefile(path:String) = {
      val input = sc.textFile( path )
      arrs = input.map(line => line.split(",").map(_.trim))
      arrrs.map(arr => (arr(33).toDouble,
                        ((9 to 14) map arr).map(Panda.f).toVector)
  }
Holden
  • 7,392
  • 1
  • 27
  • 33