-1

I have multiple JSON files with a structure similar to these ones

{
"fields": [ 
 {
   "a": 1,
   "b": "Mike",
   "c": "Jordan"
 },
 {
  "a": 2,
  "b": "Filip",
  "c": "White"
 }
]
}

{
"fields":{
  "a": 2,
  "b": "Mark",
  "c": "Brown"
 }
}

which I load in the same DataFrame

df = spark.read.option("multiLine", True).json("/path/to/jsons")

I need to extract only the value for "b" when "a" = 2 (in this case Mark and Filip). I have 2 problems:

  • "fields" contains 2 different types (Struct and Array(Struct)) in the same column
  • how to extract only field "b"

I'm using PySpark.
Thanks in advance.

df.withColumn("b values", col("fields") ... ??)
Tom C
  • 1
  • 1
  • try `df.withColumn("b values", col("fields.b"))` – Raphael Roth Jan 26 '20 at 13:55
  • @RaphaelRoth this will return me an array of all b fields, also those where a != 2, isn't it? – Tom C Jan 26 '20 at 14:37
  • Does this answer your question? [Querying Spark SQL DataFrame with complex types](https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) – user10938362 Jan 26 '20 at 14:52
  • there is no way to handle both schemas with one dataset/dataframe. You need to separate the two file types somehow or just transform everything into the array type – abiratsis Jan 26 '20 at 15:12

1 Answers1

2

Handle them separately and then union the results:

df = spark.read.options(multiLine=True).json('/FileStore/tables/fields.json')
df1 = spark.read.options(multiLine=True).json('/FileStore/tables/fields2.json')
df = df.select(explode(df['fields']))
df = df.select('col.a','col.b','col.c')
df1 = df1.select('fields.a','fields.b','fields.c')
DF = df.union(df1)
DF = DF.where(DF['a'] == 2).select(DF['b']).show()

+-----+
|    b|
+-----+
|Filip|
| Mark|
+-----+

Ravi
  • 592
  • 3
  • 11