0

I have a sequence file containing multiple json records. I want to send every json record to a function . How can I extract one json record at a time?

satyambansal117
  • 193
  • 1
  • 3
  • 13

2 Answers2

0

Unfortunately there is no standard way to do this.

Unlike YAML which has a well-defined way to allow one file contain multiple YAML "documents", JSON does not have such standards.

One way to solve your problem is to invent your own "object separator". For example, you can use newline characters to separate adjacent JSON objects. You can tell your JSON encoder not to output any newline characters (by forcing escaping it into \ and n). As long as your JSON decoder is sure that it will not see any newline character unless it separates two JSON objects, it can read the stream one line at a time and decode each line.

It has also been suggested that you can use JSON arrays to store multiple JSON objects, but it will no longer be a "stream".

Community
  • 1
  • 1
wks
  • 1,158
  • 1
  • 7
  • 12
0

You can read content of your sequence files to RDD[String] and convert it to Spark Dataframe.

val seqFileContent = sc
  .sequenceFile[LongWritable, BytesWritable](inputFilename)
  .map(x => new String(x._2.getBytes))
val dataframeFromJson = sqlContext.read.json(seqFileContent)
Gorini4
  • 373
  • 1
  • 5
  • 13