0

The dataset I have is riddled with nested fields. For instance the output of data.take(1) gives 9 columns in which the 4th column (c4) has 3 sub-fields and the 1st column of c4 has 3 sub-fields and so on.

The format looks a bit like so

[A,B,C,[[d1,d2,d3],D2,D3],E,[F1,[f1,[f21,f22,f23],f3,f4],F3,F4],G,H,I]

I would like an array of array data structure (which can be then unrolled to a single array).

Just to make the data look clearer:

A
B
C
D
  -D1
    -d1
    -d2
    -d3
  -D2
  -D3
E
F
  -F1
  -F2
    -f1
    -f2
      -f21
      -f22
      -f23
    -f3
    -f4
  -F3
  -F4
G
H
I

Of course, I could write a parsing program that would recursively search for sub-fields given a record and generate this tree structure (as an array of arrays). However, I'm hoping there would be a simpler and more efficient pre-built routine in Spark that would handle this in a straight-forward manner.

Any answer in either Spark-Scala or PySpark would be appreciated.

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
  • So, you basically want to know the schema of your dataset/dataframe? – addmeaning Jul 16 '18 at 11:10
  • @addmeaning Yes, yes exactly. In fact, I have the schema as an AVSC file, but I'm not sure how to map the schema to these records as the data is not readily in a JSON format. – Ébe Isaac Jul 16 '18 at 11:14

1 Answers1

3

Well there are two methods that can help you if you already loaded your dataframe/dataset in memory using spark. Try df.printSchema() and df.schema, where df is a variable referencing your dataset. The first one will print schema on the screen with types of data, the second one will return StructType object that can be traversed by your code.

Update:

So now you want conveniently select nested field.

Imagine you have following code:

import org.apache.spark.sql.SparkSession

object Question51360175 extends App{
    val session = SparkSession.builder()
      .appName("spark-app").master("local[*]").getOrCreate()

    import session.implicits._

    case class TopLevel(someField: String, nestedLevel: NestedLevel)
    case class NestedLevel(key: Int, value: String)



    val df = Seq(
      TopLevel("first", NestedLevel(1, "Onnu")),
      TopLevel("second", NestedLevel(2, "Rendu"))
    ).toDF

    df.printSchema()

  df.show()
}

that will give you this kind of output

root
 |-- someField: string (nullable = true)
 |-- nestedLevel: struct (nullable = true)
 |    |-- key: integer (nullable = false)
 |    |-- value: string (nullable = true)

+---------+-----------+
|someField|nestedLevel|
+---------+-----------+
|    first|  [1, Onnu]|
|   second| [2, Rendu]|
+---------+-----------+

Now if you want get nested column, for example you only want only value column from nested class NestedLevel you can just write df.select("nestedLevel.value").show() that will return you

+-----+
|value|
+-----+
| Onnu|
|Rendu|
+-----+
addmeaning
  • 1,358
  • 1
  • 13
  • 36
  • 1
    Thanks, this is exactly what I wanted. The schema aligns well with the AVSC file I have. However, I'm not sure how to navigate through it. E.g, row(n) gives the nth column, but row(n)(m) throws an error saying Any does not take parameters. How to manipulate the sub-fields? – Ébe Isaac Jul 16 '18 at 11:35
  • I will update my answer. – addmeaning Jul 16 '18 at 11:43
  • 2
    Haha just came here and saw the values holding 'Onnu, Rendu' in Tamil. Good to see :-) @addmeaning – Jay Vignesh Jul 16 '18 at 13:47
  • I have a similar question [here](https://stackoverflow.com/q/51374630/4565943) but it is about traversing a schema. Could you please take a look at it? BTW, nice play with the Tamil numbers. – Ébe Isaac Jul 17 '18 at 06:35