Creating Spark Dataframes from regular classes

Question

I have always seen that, when we are using a map function, we can create a dataframe from rdd using case class like below:-

case class filematches(
row_num:Long,
matches:Long,
non_matches:Long,
non_match_column_desc:Array[String]
)

newrdd1.map(x=> filematches(x._1,x._2,x._3,x._4)).toDF()

This works great as we all know!!

I was wondering , why we specifically need case classes here? We should be able to achieve same effect using normal classes with parameterized constructors (as they will be vals and not private):-

class filematches1(
val row_num:Long,
val matches:Long,
val non_matches:Long,
val non_match_column_desc:Array[String]
)

newrdd1.map(x=> new filematches1(x._1,x._2,x._3,x._4)).toDF

Here , I am using new keyword to instantiate the class.

Running above has given me the error:-

error: value toDF is not a member of org.apache.spark.rdd.RDD[filematches1]

I am sure I am missing some key concept on case classes vs regular classes here but not able to find it yet.

Possible duplicate of [How to store custom objects in Dataset?](https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset) — 10465355, Dec 06 '18 at 12:22

Harjeet Kumar · Accepted Answer · 2018-12-06T12:37:05.303

To resolve error of value toDF is not a member of org.apache.spark.rdd.RDD[...] You should move your case class definition out of function where you are using it. You can refer http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache/td-p/29878 for mode detail.

On your Other query - case classes are syntactic sugar and they provide following additional things

Case classes are different from general classes. They are specially used when creating immutable objects.

They have default apply function which is used as constructor to create object. (so Lesser code)
All the variables in case class are by default val type. Hence immutable. which is a good thing in spark world as all red are immutable

example for case class is case class Book( name : string) val book1 = Book("test")

you cannot change value of book1.name as it is immutable. and you do not need to say new Book() to create object here.

The class variables are public by default. so you don't need setter and getters.

Moreover while comparing two objects of case classes, their structure is compared instead of references.

Edit : Spark Uses Following class to Infer Schema Code Link : https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

If you check. in schemaFor function (Line 719 to 791). It converts Scala types to catalyst types. I this the case to handle non case classes for schema inference is not added yet. so the every time you try to use non case class with infer schema. It goes to other option and hence gives error of Schema for type $other is not supported.

Hope this helps

Hi, thanks for the reply. But I think the question is still open for me. If I am declaring a general class with val variables (which means immutable) and they are public by default, then I can achieve the same effect which case class gives as per your point #2 and 3 above? — Shubham Aggarwal, Dec 06 '18 at 10:43
Hi Shubham, I looked into spark code and have added more explanation.. please suggest if you are fine — Harjeet Kumar, Dec 06 '18 at 12:37
Thanks. I am a newbie to scala and hence its hard for me to understand the code snippet you have mentioned. But I do take it as the solution that, right now, the option to infer Schema from normal class is not there. — Shubham Aggarwal, Dec 06 '18 at 14:39

Creating Spark Dataframes from regular classes

1 Answers1