30

I have a json file:

{
  "a": {
    "b": 1
  }
}

I am trying to read it:

val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()

But getting an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default). For example: spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).json(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).json(file).cache() and then df.filter($"_corrupt_record".isNotNull).count().;

So I tried to cache it as they suggest:

val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()

But I keep getting the same error.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Alon
  • 10,381
  • 23
  • 88
  • 152
  • 3
    The error clearly says that the problem is that your **JSON** was not read properly. The reason is that **Spark** requires an specific format: _"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object."_ - [**documentation**](http://spark.apache.org/docs/latest/sql-data-sources-json.html) - Also, on the [**Scaladoc**](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) of the read method, you can see the `multiLine` option which can be useful in this case. – Luis Miguel Mejía Suárez Aug 11 '19 at 16:31
  • 1
    Please check this [link](https://stackoverflow.com/questions/38545850/read-multiline-json-in-apache-spark) for some additional info and solution. – kode Aug 11 '19 at 17:10
  • 1
    Painful issue poorly explained imho. – thebluephantom Aug 11 '19 at 21:03
  • @LuisMiguelMejíaSuárez thanks I didn't know that. Now it's working. Please write your comment as an answer and I will accept it. – Alon Aug 11 '19 at 22:24

3 Answers3

67

You may try either of these two ways.

Option-1: JSON in single line as answered above by @Avishek Bhattacharya.

Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.

val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()

Here is the output for Option-2.

20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
|  b|
+---+
|  1|
+---+
37

The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as

{
  "a": {
  "b": 1
  }
}

This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.

You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...

For more infos see

EiAlex
  • 5
  • 3
Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
-1

This error means 2 things:

1- either your file format isn't what you think (and you are using the wrong method for it, like its text but you mistakenly used json method)

2- you file doesn't follow the standards for the format you are using (while you used correct method for correct format), this usually happens with json.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Aramis NSR
  • 1,602
  • 16
  • 26