13

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:

// sc is the SparkContext, while sqlContext is the SQLContext.

// Define the case class and raw data
case class Dog(name: String)
val data = Array(
    Dog("Rex"),
    Dog("Fido")
)

// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)

// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)

// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])

// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()

The output I'm seeing is:

Dog(Rex)
Dog(Fido)
++
||
++
||
||
++

What am I missing?

Thanks!

Community
  • 1
  • 1
sparkour
  • 325
  • 2
  • 3
  • 12

3 Answers3

19

All you need is just

val dogDF = sqlContext.createDataFrame(dogRDD)

Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.

Vitalii Kotliarenko
  • 2,947
  • 18
  • 26
  • 2
    This worked. I also had to move the definition of the case class outside of my main function to avoid `error: No TypeTag available for Dog`. Thanks! – sparkour May 03 '16 at 13:06
  • I see, very interesting, so the second parameter is only ever required when calling from the Java API, scala will just automagically detect the fields of the Type that should be converted to columns? – qwwqwwq Dec 23 '16 at 18:31
  • 1
    It worked only if `case class` moved outside of `main`. @Vitalii , @ sparkour .. is there any explanation for why `case class` need to be moved outside of `main`. – Praveen L Apr 04 '18 at 06:54
  • I am getting `abstract` is a reserved keyword and cannot be used as field name as my case class has `abstract` as field name. Any workaround for this. – Anish Nov 15 '18 at 01:56
7

You can create a DataFrame directly from a Seq of case class instances using toDF as follows:

val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
David Griffin
  • 13,677
  • 5
  • 47
  • 65
0

Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.

Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like

val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }  

val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))

sqlContext.createDataFrame(rdd,rddStruct)

toDF() wont work either

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Kamaldeep Singh
  • 492
  • 4
  • 8