The dataset I have is riddled with nested fields. For instance the output of data.take(1)
gives 9 columns in which the 4th column (c4) has 3 sub-fields and the 1st column of c4 has 3 sub-fields and so on.
The format looks a bit like so
[A,B,C,[[d1,d2,d3],D2,D3],E,[F1,[f1,[f21,f22,f23],f3,f4],F3,F4],G,H,I]
I would like an array of array data structure (which can be then unrolled to a single array).
Just to make the data look clearer:
A
B
C
D
-D1
-d1
-d2
-d3
-D2
-D3
E
F
-F1
-F2
-f1
-f2
-f21
-f22
-f23
-f3
-f4
-F3
-F4
G
H
I
Of course, I could write a parsing program that would recursively search for sub-fields given a record and generate this tree structure (as an array of arrays). However, I'm hoping there would be a simpler and more efficient pre-built routine in Spark that would handle this in a straight-forward manner.
Any answer in either Spark-Scala or PySpark would be appreciated.