5

I am trying to convert a Spark RDD to a Spark SQL dataframe with toDF(). I have used this function successfully many times, but in this case I'm getting a compiler error:

error: value toDF is not a member of org.apache.spark.rdd.RDD[com.example.protobuf.SensorData]

Here is my code below:

// SensorData is an auto-generated class
import com.example.protobuf.SensorData
def loadSensorDataToRdd : RDD[SensorData] = ???

object MyApplication {
  def main(argv: Array[String]): Unit = {

    val conf = new SparkConf()
    conf.setAppName("My application")
    conf.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
    val sc = new SparkContext(conf)

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    val sensorDataRdd = loadSensorDataToRdd()
    val sensorDataDf = sensorDataRdd.toDF() // <-- CAUSES COMPILER ERROR
  }
}

I am guessing that the problem is with the SensorData class, which is a Java class that was auto-generated from a Protocol Buffer. What can I do in order to convert the RDD to a dataframe?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

1 Answers1

3

The reason for the compilation error is that there's no Encoder in scope to convert a RDD with com.example.protobuf.SensorData to a Dataset of com.example.protobuf.SensorData.

Encoders (ExpressionEncoders to be exact) are used to convert InternalRow objects into JVM objects according to the schema (usually a case class or a Java bean).

There's a hope you can create an Encoder for the custom Java class using org.apache.spark.sql.Encoders object's bean method.

Creates an encoder for Java Bean of type T.

Something like the following:

import org.apache.spark.sql.Encoders
implicit val SensorDataEncoder = Encoders.bean(classOf[com.example.protobuf.SensorData])

If SensorData uses unsupported types you'll have to map the RDD[SensorData] to an RDD of some simpler type(s), e.g. a tuple of the fields, and only then expect toDF work.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 1
    +1 Converting a custom object RDD to Dataset (aka DataFrame) is not the right answer, but going to Dataset via an encoder IS the right answer. Datasets with custom objects are ideal because you'll get compilation errors and catalyst optimizer performance gains. – Garren S Apr 30 '17 at 01:34
  • 1
    Sure! Wouldn't have said it better @Garren, but that was outside this question and hence didn't even bother mentioning it. Good point, though! Thanks. I might add your comment if I see the comment's upvotes (yours or mine) ;-) – Jacek Laskowski Apr 30 '17 at 06:59
  • The question was about converting a custom object RDD to a Dataframe which would be a silly conversion, so I felt clarifying your intent to use a Dataset instead of the specific DataFrame request was tangentially within the scope of the question – Garren S Apr 30 '17 at 15:18
  • @JacekLaskowski: Please do add any additional information that could help me (and others). – stackoverflowuser2010 May 01 '17 at 18:50
  • @Garren: I am relatively new to Spark and have been using either RDDs or DataFrames. I am not familiar with DataSets at all, but I was under the impression that a DataFrame is simply an alias for a DataSet (see this page: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-dataframe.html). Is there a more substantial difference between DataFrame and a DataSet that would help my understanding to solve this problem? – stackoverflowuser2010 May 01 '17 at 18:52
  • 1
    @stackoverflowuser2010 DataFrame is indeed an alias specifically for `Dataset` - Row is a _generic row object_ NOT a custom class object (such as SensorData). That excellent book you cited is written by the very guy who answered your question ;). I can't find the image I want showing the RDD/DF/DS run/compile time breakdowns, but runner up: http://i.stack.imgur.com/3rF6p.png http://stackoverflow.com/questions/35424854/what-is-the-difference-between-spark-dataset-and-rdd http://stackoverflow.com/questions/34654145/how-to-convert-dataframe-to-dataset-in-apache-spark-in-java – Garren S May 02 '17 at 00:38