JSON object spans multiple lines, How to split input in Hadoop

Question

I need to ingest large JSON files whose records may span multiple lines (not files) (depends entirely on how the data provider is writing it).

Elephant-Bird assumes LZO compression, which I know the data provider will not be doing.

The Dzone article http://java.dzone.com/articles/hadoop-practice makes the assumption that the JSON record will be on the same line.

Any ideas, with the exception of squishing the JSON... file will be huge... on how to properly split the file such that the JSON does not break.

Edit: lines, not files

If "validation" is what you want to do (i.e. establishing a context that lets you know when a JSON string has finished syntactically, without loading the whole thing into memory), you could look into event-based parsers (similar to what SAX does with XML). This answer http://stackoverflow.com/a/823632/18771 lists few. — Tomalak, Aug 13 '12 at 17:01
The data provider will be preparing a load of data in JSON format (don't ask my why JSON, I think they have that feed already set up on their end). Some proprietary system will put the file into HDFS to be run through the M/R process. The idea is that when the input file is read, I can reliably split it such that the top-level JSON objects don't get ruined. The issue is I've no control over the file itself so it could get dumped on me with a top-level object spanning multiple lines. — Maz, Aug 13 '12 at 17:18
Chukwa should have a JSONInputFormat, but I don't know if it reads multi-lines. — Thomas Jungblut, Aug 13 '12 at 17:25
So my suggestion would be pass it through a simple streaming parser (along the lines of [this article, using Jackson](http://www.mkyong.com/java/jackson-streaming-api-to-read-and-write-json/), see point 2). The parser would have to do only one thing: handling nesting depth, watching for the correct end token. This way you'd know when the JSON object is finished (and, FWIW, whether it's well-formed). I can't create such a program for you, though, so this is just a rough idea. — Tomalak, Aug 13 '12 at 17:30
Chukwa has a JSONLoader, but not an InputFormat extended class for JSON. — Maz, Aug 13 '12 at 17:31
One more thing you have to worry about is the record itself might be split across multiple mappers unless you implement your custom split as well. — Fakrudeen, Aug 14 '12 at 08:49

score 2 · Answer 1 · answered Aug 14 '12 at 01:10

Short of any other suggestions, and dependent on how the JSON is being formatted, you may have an option.

The problem, as pointed out in the Dzone article, is that JSON has no end element that you can easily locate when you jump to a split point.

Now if your input JSON has 'pretty' or standard formatting you can take advantage of this in a custom input format implementation.

For example, taking the sample JSON from the Dzone example:

{
  "results" :
    [
      {
        "created_at" : "Thu, 29 Dec 2011 21:46:01 +0000",
        "from_user" : "grep_alex",
        "text" : "RT @kevinweil: After a lot of hard work by ..."
      },
      {
        "created_at" : "Mon, 26 Dec 2011 21:18:37 +0000",
        "from_user" : "grep_alex",
        "text" : "@miguno pull request has been merged, thanks again!"
      }
    ]
}

with this format, you know (hope?) that each new record starts on a line that has 6 whitespaces and an open bracket. A record ends on a similar format - 6 spaces and a closing bracket.

So your logic in this case: consume lines until you find a line with 6 spaces and an open bracket. Then buffer content until the find the 6 spaces and a closing bracket. Then use whatever JSON deserializer you want to turn that into a java object (or just pass the multiline Text to your mapper.

The provider's JSON format is pretty simple. While I appreciate the response and will bookmark it for future reference, I ultimately decided to assume it was a simple key-value pair with no depth (in this instance it is). My desire was to "get it right the first time" by having a solution for future JSON providers... but agility is trumping perfection in this instance. — Maz, Aug 14 '12 at 13:28

score 1 · Answer 2 · edited May 23 '17 at 12:23

The best way for you to split and parse multi-line JSON data would be to extend NLineInputFormat class and define your own notion of what constitutes an InputSplit. [For example : 1000 JSON records could constitute 1 split]

Then, you would need to extend LineRecordReader class and define your own notion of what constitutes 1 line [in this case, 1 record].

This way, you would get well-defined splits, each containing 'N' JSON records, which can then be read using the same LineRecordReader and each of your map tasks would receive one record to process at a time.

Charles Menguy's reply to How does Hadoop process records split across block boundaries? explains the nuance in this approach very well.

For a sample such extension of NLineInputFormat, check out http://hadooped.blogspot.com/2013/09/nlineinputformat-in-java-mapreduce-use.html

A similar multi-line CSV format for Hadoop can be found here : https://github.com/mvallebr/CSVInputFormat

Update : I found a relevant multi-line JSON input format for Hadoop here: https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java

JSON object spans multiple lines, How to split input in Hadoop

2 Answers2