People who loves example, here it is:
- create sample employee data
case class Employ(name: String, age: Int, id: Int, department: String)
val empData = Seq(Employ("A", 24, 132, "HR"), Employ("B", 26, 131, "Engineering"), Employ("C", 25, 135, "Data Science"))
create an dataframe and dataset data
val empRDD = spark.sparkContext.makeRDD(empData)
val empDataFrame = empRDD.toDf()
val empDataset = empRDD.toDS()
Lets perform an operation :
Dataset
val empDatasetResult = empDataset.filter(employ => employ.age > 24)
Dataframe
val empDatasetResult = empDataframe.filter(employ => employ.age > 24)
//thows error "value age is not a member of org.apache.spark.sql.Row object."
In the case of Dataframe when we perform lambda it returns a Row object and not an Integer object so you cant directly do employ.age > 24
, but you can do below:
val empDataFrameResult = empDataFrame.filter(employ => employ.getAs[Int]("age") > 24)
Why is the dataset so special then?
- Less development labor Don't need to know the data type of data when
performing an operation.
Who don't like boilerplate code? Let's create it using Datasets ..
Thanks to :https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/