5

I am trying to understand the difference between Dataset and data frame and found the following helpful link , but i am not able to understand what is meant by type safe?

Difference between DataFrame (in Spark 2.0 i.e DataSet[Row] ) and RDD in Spark

hari haran
  • 51
  • 1
  • 1
  • 5

3 Answers3

7

RDDs and Datasets are type safe means that compiler know the Columns and it's data type of the Column whether it is Long, String, etc....

But, In Dataframe, every time when you call an action, collect() for instance,then it will return the result as an Array of Rows not as Long, String data type. In dataframe, Columns have their own type such as integer, String but they are not exposed to you. To you, its any type. To convert the Row of data into it's suitable type you have to use .asInstanceOf method.

eg: In Scala:

scala > :type df.collect()
Array[org.apache.spark.sql.Row]


df.collect().map{ row => 
    val str = row(0).asInstanceOf[String]
    val num = row(1).asInstanceOf[Long]
}                      
Phung Duy Phong
  • 876
  • 6
  • 18
Saman
  • 333
  • 3
  • 9
2

People who loves example, here it is:

  1. create sample employee data
 case class Employ(name: String, age: Int, id: Int, department: String)

val empData = Seq(Employ("A", 24, 132, "HR"), Employ("B", 26, 131, "Engineering"), Employ("C", 25, 135, "Data Science"))
  1. create an dataframe and dataset data

    val empRDD = spark.sparkContext.makeRDD(empData)
    val empDataFrame = empRDD.toDf()
    val empDataset = empRDD.toDS()
    

Lets perform an operation :

Dataset

val empDatasetResult = empDataset.filter(employ => employ.age > 24)

Dataframe

    val empDatasetResult = empDataframe.filter(employ => employ.age > 24)

//thows error "value age is not a member of org.apache.spark.sql.Row object."

In the case of Dataframe when we perform lambda it returns a Row object and not an Integer object so you cant directly do employ.age > 24 , but you can do below:

val empDataFrameResult = empDataFrame.filter(employ => employ.getAs[Int]("age") > 24)

Why is the dataset so special then?

  • Less development labor Don't need to know the data type of data when performing an operation.

Who don't like boilerplate code? Let's create it using Datasets ..

Thanks to :https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/

Neethu Lalitha
  • 3,031
  • 4
  • 35
  • 60
1

Type safe is an advance API in Spark 2.0.

We need this API to do more complex operations on rows in a dataset.

e.g.:

departments.joinWith(people, departments("id") === people("deptId"), "left_outer").show
riskypenguin
  • 2,139
  • 1
  • 10
  • 22
  • your example proves Dataset is not type safe: you just used strings in your join: "deptId", "left_outer" etc – Adrian Dec 07 '21 at 08:15