What is meant by type safe in spark Dataset ?

Question

I am trying to understand the difference between Dataset and data frame and found the following helpful link , but i am not able to understand what is meant by type safe?

Difference between DataFrame (in Spark 2.0 i.e DataSet[Row] ) and RDD in Spark

type safe means "compile time type safety" which is very clearly explained in Amit Dubey's answer in the same post. Pls re-read. — sujit, Mar 23 '18 at 13:08

score 7 · Answer 1 · edited Jan 10 '20 at 13:19

RDDs and Datasets are type safe means that compiler know the Columns and it's data type of the Column whether it is Long, String, etc....

But, In Dataframe, every time when you call an action, collect() for instance,then it will return the result as an Array of Rows not as Long, String data type. In dataframe, Columns have their own type such as integer, String but they are not exposed to you. To you, its any type. To convert the Row of data into it's suitable type you have to use .asInstanceOf method.

eg: In Scala:

scala > :type df.collect()
Array[org.apache.spark.sql.Row]


df.collect().map{ row => 
    val str = row(0).asInstanceOf[String]
    val num = row(1).asInstanceOf[Long]
}

score 2 · Answer 2 · answered Oct 08 '20 at 20:58

People who loves example, here it is:

create sample employee data

 case class Employ(name: String, age: Int, id: Int, department: String)

val empData = Seq(Employ("A", 24, 132, "HR"), Employ("B", 26, 131, "Engineering"), Employ("C", 25, 135, "Data Science"))

create an dataframe and dataset data

val empRDD = spark.sparkContext.makeRDD(empData)
val empDataFrame = empRDD.toDf()
val empDataset = empRDD.toDS()

Lets perform an operation :

Dataset

val empDatasetResult = empDataset.filter(employ => employ.age > 24)

Dataframe

    val empDatasetResult = empDataframe.filter(employ => employ.age > 24)

//thows error "value age is not a member of org.apache.spark.sql.Row object."

In the case of Dataframe when we perform lambda it returns a Row object and not an Integer object so you cant directly do employ.age > 24 , but you can do below:

val empDataFrameResult = empDataFrame.filter(employ => employ.getAs[Int]("age") > 24)

Why is the dataset so special then?

Less development labor Don't need to know the data type of data when performing an operation.

Who don't like boilerplate code? Let's create it using Datasets ..

Thanks to :https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/

score 1 · Answer 3 · edited Sep 25 '19 at 12:00

1

Type safe is an advance API in Spark 2.0.

We need this API to do more complex operations on rows in a dataset.

e.g.:

departments.joinWith(people, departments("id") === people("deptId"), "left_outer").show

edited Sep 25 '19 at 12:00

riskypenguin

2,139
1
10
22

answered Sep 25 '19 at 09:40

PRAFULLA KUMAR DASH

33
4

your example proves Dataset is not type safe: you just used strings in your join: "deptId", "left_outer" etc – Adrian Dec 07 '21 at 08:15

What is meant by type safe in spark Dataset ?

3 Answers3