I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema()
result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record
is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline
.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?