0

I am trying to read the Json data by using Apache Spark. Here is the code what i have tried so far:

val conf = new SparkConf()
      .setAppName("ExplodeDemo")
      .setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
val df = sqlContext.read.json("file location")
df.printSchema()

which works well when i pass the file name as a argument to the sqlContext.read.json, but my requirement is to pass the json String directly insted of file.

for that i tried i tried like this:

val rdd = sc.parallelize(Seq(r))
val df = sqlContext.read.json(rdd)
df.printSchema()

where r is my json String, by using this code, there are no compilation errors. But when i tried df.printSchema() it shows like this, and not able to retreive the data.

root
 |-- _corrupt_record: string (nullable = true)
ROOT
  • 1,757
  • 4
  • 34
  • 60
  • How does your JSON data look like? Spark can only read one JSON object per line (or per file if you set `multiline` to `true`) – philantrovert Feb 05 '18 at 09:07
  • It's a valid json, i do tested with jsonlint.com – ROOT Feb 05 '18 at 09:10
  • i am not asking how to access sub entities of nested json, i am asking how its printing _corrupt_record insted of schema? – ROOT Feb 06 '18 at 05:28

1 Answers1

0

Well, you need to provide schema as well.

DataFrame is just an RDD with Schema. While using Datasoure API, Spark will infer the schema by reading the file. As you are not using Datasoure APi to infer the schema automatically, you need to pass the schema explicitly.

val YOURSCHEMA= StructType(Array(
  StructField("Attribute1", LongType, true),
  StructField("Attribute2", IntType, true)))
val df=spark.read.schema(YOURSCHEMA).json(rdd)
df.printSchema
Ishan Kumar
  • 1,941
  • 3
  • 20
  • 29