-2

After reading some technical articles, it is said that dataframe only knows the name of the column but not the type. However, after calling the `printSchema function of dataframe in person, the name and type of the column can be printed out. I am very doubtful about this. I am looking forward to your answer.

example:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession

case class Person(name:String,age:Long)
object DS {
  def main(args: Array[String]): Unit = {
    val config = new SparkConf().setAppName("sparkSql").setMaster("local[*]")
    val sc = new SparkContext(config)
    val spark = SparkSession.builder().config(config).getOrCreate()
    val seq = Seq(("aa",1),("bb",2))
    import spark.implicits._
    val rdd = sc.makeRDD(seq)
    val df = rdd.toDF("name","age")
    val ds = rdd.map(line =>{Person(line._1,line._2)}).toDS()

    println("dataframe schema:")
    df.printSchema()
/*
    dataframe schema:
      root
    |-- name: string (nullable = true)
    |-- age: integer (nullable = true)
*/
    println("dataset schema:")
    ds.printSchema()
/*
    dataset schema:
      root
    |-- name: string (nullable = true)
    |-- age: long (nullable = true)
*/
  }
}

img

For this example,age type of dataframe schema is integer ,age type of dataset schema is long , age type of class Person is long .

Wayne
  • 3
  • 3
  • 1
    Possible duplicate of [Difference between DataSet API and DataFrame API](https://stackoverflow.com/questions/37301226/difference-between-dataset-api-and-dataframe-api) – philantrovert Jan 16 '19 at 09:11

2 Answers2

1

It depends on what type of file you are reading.

If it is a CSV file without header then you need to provide the column name and data type using schema.

It is a CSV file with header, then you need to use an "inferSchema"->"true" as an option while reading a file. This option automatically infers the schema and data types. However, data type is automatically driven from the first few records of actual data.

val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

For any reason, if your first few records of a column have a value integer and other records have a string then you will have issues hence, it always a best practice to provide the schema explicitly.

Your code is working as expected.

Below statement automatically infers the data type to Int for age based on the data Seq(("aa",1),("bb",2))

val df = rdd.toDF("name","age")

However, when you convert Dataframe to Dataset

val ds = rdd.map(line =>{Person(line._1,line._2)}).toDS()

Here, you are converting to Person which has Long data type for "age" field hence, you are seeing it as Long as expected. Note that automatically converting from Int to Long is done by Scala (up cast) not Spark.

Hope this clarifies !!

Below link is a good read on how to provide a complex schema. hope this gives you more idea.

https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803

Thanks

NNK
  • 1,044
  • 9
  • 24
0

In the first example where you use rdd.toDF("name", "age"), you do not explicitly provide a schema for the DataFrame. And, DataFrames are actually just DataSet[Row]. Hence, Spark picks the best possible datatype based on the data (int based on 1 and 2).

In the second example, you create a DataSet which preserve the data type based on the schema provided. So:

val ds = rdd.map(line => Person(line._1,line._2) ).toDS()

Creates a DataSet[Person] which keeps the schema specified intact.

philantrovert
  • 9,904
  • 3
  • 37
  • 61