Spark JSON extract element from ArrayType()

Question

I have multiple JSON files with a structure similar to these ones

{
"fields": [ 
 {
   "a": 1,
   "b": "Mike",
   "c": "Jordan"
 },
 {
  "a": 2,
  "b": "Filip",
  "c": "White"
 }
]
}

{
"fields":{
  "a": 2,
  "b": "Mark",
  "c": "Brown"
 }
}

which I load in the same DataFrame

df = spark.read.option("multiLine", True).json("/path/to/jsons")

I need to extract only the value for "b" when "a" = 2 (in this case Mark and Filip). I have 2 problems:

"fields" contains 2 different types (Struct and Array(Struct)) in the same column
how to extract only field "b"

I'm using PySpark.
Thanks in advance.

df.withColumn("b values", col("fields") ... ??)

@RaphaelRoth this will return me an array of all b fields, also those where a != 2, isn't it? — Tom C, Jan 26 '20 at 14:37
Does this answer your question? [Querying Spark SQL DataFrame with complex types](https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) — user10938362, Jan 26 '20 at 14:52
there is no way to handle both schemas with one dataset/dataframe. You need to separate the two file types somehow or just transform everything into the array type — abiratsis, Jan 26 '20 at 15:12

score 2 · Answer 1 · answered Jan 26 '20 at 15:24

Handle them separately and then union the results:

df = spark.read.options(multiLine=True).json('/FileStore/tables/fields.json')
df1 = spark.read.options(multiLine=True).json('/FileStore/tables/fields2.json')
df = df.select(explode(df['fields']))
df = df.select('col.a','col.b','col.c')
df1 = df1.select('fields.a','fields.b','fields.c')
DF = df.union(df1)
DF = DF.where(DF['a'] == 2).select(DF['b']).show()

+-----+
|    b|
+-----+
|Filip|
| Mark|
+-----+

Spark JSON extract element from ArrayType()

1 Answers1