I am new to spark scala. Could someone help clarify the confusion below?
Question 1: When spark dataframe contains struct, the spark UDF function often takes input arguments type such as Row or Seq[Row].
a. What is the difference between Row and StructType?
b. Why couldn't the spark UDF function takes input Seq[StructType]?
c. Seq is scala datatype, while Row is spark datatype. Why does the UDF function mix these two datatype?
Question 2: When creating a dataframe, why does the simpleData mix scala datatype Seq and spark datatype Row? Could it be Seq(StructType("James ","","Smith","36636","M",3000),
?
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema)
df.printSchema()
df.show()
Follow up:
I find the spark "Data type" and "Value type in Scala". spark "Data type" is StructType, while "Value type in Scala" is org.apache.spark.sql.Row. What is the difference between DataType and ValueType? https://spark.apache.org/docs/latest/sql-ref-datatypes.html