Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Question

I have a json file:

{
  "a": {
    "b": 1
  }
}

I am trying to read it:

val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()

But getting an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default). For example: spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).json(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).json(file).cache() and then df.filter($"_corrupt_record".isNotNull).count().;

So I tried to cache it as they suggest:

val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()

But I keep getting the same error.

The error clearly says that the problem is that your **JSON** was not read properly. The reason is that **Spark** requires an specific format: _"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object."_ - [**documentation**](http://spark.apache.org/docs/latest/sql-data-sources-json.html) - Also, on the [**Scaladoc**](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) of the read method, you can see the `multiLine` option which can be useful in this case. — Luis Miguel Mejía Suárez, Aug 11 '19 at 16:31
Please check this [link](https://stackoverflow.com/questions/38545850/read-multiline-json-in-apache-spark) for some additional info and solution. — kode, Aug 11 '19 at 17:10
@LuisMiguelMejíaSuárez thanks I didn't know that. Now it's working. Please write your comment as an answer and I will accept it. — Alon, Aug 11 '19 at 22:24

score 67 · Answer 1 · answered Jul 29 '20 at 17:49

You may try either of these two ways.

Option-1: JSON in single line as answered above by @Avishek Bhattacharya.

Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.

val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()

Here is the output for Option-2.

20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
|  b|
+---+
|  1|
+---+

score 37 · Accepted Answer · edited Aug 25 '20 at 06:59

The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as

{
  "a": {
  "b": 1
  }
}

This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.

You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...

For more infos see

score -1 · Answer 3 · edited Mar 30 '21 at 10:39

-1

This error means 2 things:

1- either your file format isn't what you think (and you are using the wrong method for it, like its text but you mistakenly used json method)

2- you file doesn't follow the standards for the format you are using (while you used correct method for correct format), this usually happens with json.

edited Mar 30 '21 at 10:39

Dharman

30,962
25
85
135

answered Mar 30 '21 at 10:33

Aramis NSR

1,602
16
26

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

3 Answers3

Linked