2

I have test JSON data at following link

http://developer.trade.gov/api/market-research-library.json

When I am trying to read schema directly from it in following manner

public void readJsonFormat() {
        Dataset<Row> people = spark.read().json("market-research-library.json");
        people.printSchema();
    }

It is giving me error as

root
 |-- _corrupt_record: string (nullable = true)

If it is malformed, how to convert it into format as expected by Spark.

zero323
  • 322,348
  • 103
  • 959
  • 935
Utkarsh Saraf
  • 475
  • 8
  • 31

3 Answers3

3

Converting your json to single line.

Or set option("multiLine", true) to allow multiply line json.

Zhang Tong
  • 4,569
  • 3
  • 19
  • 38
1

If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.

Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.

sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))

You should have your required dataframe with following schema

root
 |-- basePath: string (nullable = true)
 |-- definitions: struct (nullable = true)
 |    |-- Report: struct (nullable = true)
 |    |    |-- properties: struct (nullable = true)
 |    |    |    |-- click_url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- country: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- description: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- expiration_date: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- id: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- report_type: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- source_industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- title: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |-- host: string (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- paths: struct (nullable = true)
 |    |-- /market_research_library/search: struct (nullable = true)
 |    |    |-- get: struct (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- parameters: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- format: string (nullable = true)
 |    |    |    |    |    |-- in: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- required: boolean (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- responses: struct (nullable = true)
 |    |    |    |    |-- 200: struct (nullable = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- schema: struct (nullable = true)
 |    |    |    |    |    |    |-- items: struct (nullable = true)
 |    |    |    |    |    |    |    |-- $ref: string (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- tags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |-- produces: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- schemes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- swagger: string (nullable = true)
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
1

The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:

import jsonlines
import json


with open('C:/Users/ak/Documents/card.json', 'r') as f:
    json_data = json.load(f)

with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
    writer.write_all(json_data)

Then you can access the file in your program as you have written in your code.