4

I'm struggling to flatten a JSON schema that has structs inside structs. The problem is that the inner structs names are dynamic so I can't access them easily using "." notion

The schema is similar to :

    root
 |-- A: string (nullable = true)
 |-- Plugins: struct (nullable = true)
 |    |-- RfS: struct (nullable = true)
 |        |-- A
 |        |-- B
 |    |-- RtW: struct (nullable = true)
 |        |-- A
 |        |-- B

so As and Bs are fixed, but every JSON file has different number structs with different names (RfS,RtW) .. could be 2 .. could be 5 .. with dyanmic names that I don't know.

How can I flatten this structure easily in a dynamic way ?

Morano88
  • 2,047
  • 4
  • 25
  • 44
  • could u provide required output in printSchema() or tabular form? one way to go about this could be to first identify ur structs and fields df.schema and then use `struct.*` to dynamically select all.. – murtihash Jun 15 '20 at 00:54
  • @Morano88, can you please let us know, how you created such dynamic column with sample data? – Som Jun 15 '20 at 03:24

1 Answers1

2

The next solution is using a single select and chain function for flattening the final columns:

from pyspark.sql.functions import col
from itertools import chain

jsonData = """{
  "A" : "some A",
  "Plugins": {
    "RfS": {
      "A" : "RfSA",
      "B" : "RfSB"
    },
    "RtW" : {
      "A" : "RtWA",
      "B" : "RtWA"
    }
  }
}"""

df = spark.read.json(sc.parallelize([jsonData]))

no_plug_cols = ["A"] # cols not in Plugins i.e A
plug_df = df.select("A", "Plugins.*")

# plug_df.printSchema()
# root
#  |-- A: string (nullable = true)
#  |-- RfS: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)
#  |-- RtW: struct (nullable = true)
#  |    |-- A: string (nullable = true)
#  |    |-- B: string (nullable = true)

# note that we use sets i.e set(plug_df.columns) - set(no_plug_cols) to retrieve cols in Plugins only
icols = [(col(f"{c}.A").alias(f"{c}.A"), col(f"{c}.B").alias(f"{c}.B")) 
         for c in (set(plug_df.columns) - set(no_plug_cols))]

# we use chain to flatten icols which is a list of tuples
plug_df.select(no_plug_cols + list(chain(*icols))).show()

# +------+-----+-----+-----+-----+
# |     A|RfS.A|RfS.B|RtW.A|RtW.B|
# +------+-----+-----+-----+-----+
# |some A| RfSA| RfSB| RtWA| RtWA|
# +------+-----+-----+-----+-----+
abiratsis
  • 7,051
  • 3
  • 28
  • 46