0

I'm loading the list of JSON files from a folder where each file ends with a number, so I am using the wild card to load all files at once.

raw_assignments_2 = spark.read.option("multiline","true").option(schema=schema).json("Assignments_*.json")

I am missing one key/value in some files and the spark is ignoring those files while reading the data into DF.

For example,

My file 1 contains the below keys and their values

[{ "id": 8731,
  "resource_type":"assignment",
  "assignee_id":2478
  "status":"complete"}]

My file 2 contains only three keys

[{ "id": 8731,
  "resource_type":"assignment",
  "assignee_id":2478}]

id,resource_type,assignee_id are mandatory fields that I expect from every JSON file where status is not mandatory. How can I fetch status into dataframe and assign a null value when there is no key in JSON file

noufel13
  • 653
  • 4
  • 4
Rahul
  • 467
  • 1
  • 8
  • 24
  • Try to format the input in this way https://stackoverflow.com/questions/38895057/reading-json-with-apache-spark-corrupt-record – Marco Massetti Feb 12 '21 at 23:53

1 Answers1

2

In the schema definition you can set the parameter nullable=True then the JSON will be loaded and the missing field set to null. If you would like a dedicated status flag you can afterwards filter for nulls on the columns that are mandatory.

customschema=StructType([
    StructField("id",DoubleType(), nullable=True),
    StructField("resource_type",StringType(), True),
    StructField("assignee_id",DoubleType(), True),
    StructField("status",StringType(), True)
])

raw_assignments_2 = spark.read.option("multiline","true").schema(customschema).json("data*.json")
raw_assignments_2.show()

raw_assignments_2.withColumn("Issue",f.when(f.col("status").isNull(),True).otherwise(False)).show()


+------+-------------+-----------+--------+
|    id|resource_type|assignee_id|  status|
+------+-------------+-----------+--------+
|8731.0|   assignment|     2478.0|complete|
|8731.0|   assignment|     2478.0|    null|
+------+-------------+-----------+--------+

+------+-------------+-----------+--------+-----+
|    id|resource_type|assignee_id|  status|Issue|
+------+-------------+-----------+--------+-----+
|8731.0|   assignment|     2478.0|complete|false|
|8731.0|   assignment|     2478.0|    null| true|
+------+-------------+-----------+--------+-----+

Disadvantage: it will also flag entries that are present in the JSON but have the value null

Alex Ortner
  • 1,097
  • 8
  • 24