What is the difference between StructType and Row in spark?

Question

I am new to spark scala. Could someone help clarify the confusion below?

Question 1: When spark dataframe contains struct, the spark UDF function often takes input arguments type such as Row or Seq[Row].

a. What is the difference between Row and StructType?

b. Why couldn't the spark UDF function takes input Seq[StructType]?

c. Seq is scala datatype, while Row is spark datatype. Why does the UDF function mix these two datatype?

Question 2: When creating a dataframe, why does the simpleData mix scala datatype Seq and spark datatype Row? Could it be Seq(StructType("James ","","Smith","36636","M",3000), ?

val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
    Row("Michael ","Rose","","40288","M",4000),
    Row("Robert ","","Williams","42114","M",4000),
    Row("Maria ","Anne","Jones","39192","F",4000),
    Row("Jen","Mary","Brown","","F",-1)
  )

val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("id", StringType, true),
    StructField("gender", StringType, true),
    StructField("salary", IntegerType, true)
  ))

  val df = spark.createDataFrame(
      spark.sparkContext.parallelize(simpleData),simpleSchema)
  df.printSchema()
  df.show()

Follow up:

I find the spark "Data type" and "Value type in Scala". spark "Data type" is StructType, while "Value type in Scala" is org.apache.spark.sql.Row. What is the difference between DataType and ValueType? https://spark.apache.org/docs/latest/sql-ref-datatypes.html

For example, "structs are modeled as another Row in spark" in this link https://stackoverflow.com/questions/40526054/get-elements-of-type-structure-of-row-by-name-in-spark-scala — thinkdeep, Nov 10 '22 at 02:26
`Row` is values, `StructType` is their types. `Seq(StructType("James ","","Smith"...` doesn't make sense. — Dmytro Mitin, Nov 10 '22 at 04:06
@DmytroMitin thanks for your comment. Could you check the Follow up questions above? — thinkdeep, Nov 10 '22 at 21:40

score 1 · Accepted Answer · answered Nov 10 '22 at 15:07

1

What is the difference between Row and StructType?

StructType is a builtin DataType from org.apache.spark.sql.types that implements scala.collection.Seq<StructField>.

In simple words, it is a Seq[StructFields] and is used to define the schema for dataframes/datasets.

While Row object is a value of StructType.

answered Nov 10 '22 at 15:07

Hema Jayachandran

86
6

Why does the spark return `Row` type for `StructType`? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L208 – thinkdeep Nov 10 '22 at 21:53
1

Not sure if it was a question or you already got the answer. In both ways, the mapping shows the reason of why `Row` is returned for `StructType`. – Hema Jayachandran Nov 11 '22 at 06:43

What is the difference between StructType and Row in spark?

1 Answers1