6

I'm reading a dataframe from parquet file, which has nested columns (struct). How can I check if nested columns are present?

It might be like this

+----------------------+
| column1              |
+----------------------+
|{a_id:[1], b_id:[1,2]}|
+----------------------+

or like this

+---------------------+
| column1             |
+---------------------+
|{a_id:[3,5]}         |
+---------------------+

I know, how to check if top-level column is present, as answered here: How do I detect if a Spark DataFrame has a column :

df.schema.fieldNames.contains("column_name")

But how can I check for nested column?

statanly
  • 85
  • 2
  • 8
  • 1
    You can use `.printSchema()` to analyze the inferred schema. Also you can convert to a typed `Dataset` by defining `case class myClass(...)` and using `.as[myClass]` to see if it converts successfully. – Travis Hegner Mar 14 '19 at 13:35
  • 1
    [this answer](https://stackoverflow.com/a/36332079) explains it. This is most reliable method to check for nested columns. – shanmuga Mar 14 '19 at 14:07
  • Possible duplicate of [How do I detect if a Spark DataFrame has a column](https://stackoverflow.com/questions/35904136/how-do-i-detect-if-a-spark-dataframe-has-a-column) – 10465355 Mar 14 '19 at 14:33

1 Answers1

7

You can get schema of nested field as struct, and then check if your field is present in field names of it:

val index = df.schema.fieldIndex("column1")
val is_b_id_present = df.schema(index).dataType.asInstanceOf[StructType]
                          .fieldNames.contains("b_id")
Viacheslav Shalamov
  • 4,149
  • 6
  • 44
  • 66