6

I have a large file (>50Mb) which contains a JSON hash. Something like:

{ 
  "obj1": {
    "key1": "val1",
    "key2": "val2"
  },
  "obj2": {
    "key1": "val1",
    "key2": "val2"
  }
  ...
}

Rather than parsing the entire file and taking say the first ten elements, I'd like to parse each item in the hash. I actually don't care about the key, i.e. obj1.

If I convert the above to this:

  {
    "key1": "val1",
    "key2": "val2"
  }
  "obj2": {
    "key1": "val1",
    "key2": "val2"
  }

I can easily achieve what I want using Yajl streaming:

io = File.open(path_to_file)
count = 10
Yajl::Parser.parse(io) do |obj|
  puts "Parsed: #{obj}"
  count -= 1
  break if count == 0
end
io.close

Is there a way to do this without having to alter the file? Some sort of callback in Yajl maybe?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
rainkinz
  • 10,082
  • 5
  • 45
  • 73
  • Yajl is *supposed* to support a SAX-type parser, which would let you read the file, and, as it's being read, selectively process objects, however, like you, I don't see any examples of doing that with the Ruby interface. Streaming isn't going to help you if the entire document has to be read into memory before the JSON is parsed and an object is returned. Your code only results in a partially read Ruby structure for big files. Yajl won't have seen the closing braces and brackets necessary for it to know exactly where to close objects, so I think your idea isn't going to work right. – the Tin Man Jan 07 '14 at 19:40
  • From my tests, the `obj` in your block won't be available until the file is read in full. Perhaps the developers of the Ruby gem can shed more light on it? – the Tin Man Jan 07 '14 at 19:42
  • 1
    @theTinMan thanks. Yeah it seems like there is support for SAX type parsing in Yajl, but not in the ruby wrapper around it. Right that obj in my block wouldn't be available until I read the whole file. Not desirable. I found another solution and pasted an answer below that I'd be interested in your feedback on. – rainkinz Jan 08 '14 at 16:44

2 Answers2

12

I ended up solving this using JSON::Stream which has callbacks for start_document, start_object etc.

I gave my 'parser' a to_enum method which emits all the 'Resource' objects as they're parsed. Note that ResourcesCollectionNode is never really used unless you completely parse the JSON stream, and the ResourceNode is a subclass of ObjectNode for naming purposes only, though I might just get rid of it:

class Parser
  METHODS = %w[start_document end_document start_object end_object start_array end_array key value]

  attr_reader :result

  def initialize(io, chunk_size = 1024)
    @io = io
    @chunk_size = chunk_size
    @parser = JSON::Stream::Parser.new

    # register callback methods
    METHODS.each do |name|
      @parser.send(name, &method(name))
    end 
  end

  def to_enum
    Enumerator.new do |yielder|
      @yielder = yielder
      begin
        while !@io.eof?
          # puts "READING CHUNK"
          chunk = @io.read(@chunk_size)
          @parser << chunk
        end
      ensure
        @yielder = nil
      end
    end
  end

  def start_document
    @stack = []
    @result = nil
  end

  def end_document
    # @result = @stack.pop.obj
  end

  def start_object
    if @stack.size == 0
      @stack.push(ResourceCollectionNode.new)
    elsif @stack.size == 1
      @stack.push(ResourceNode.new)
    else
      @stack.push(ObjectNode.new)
    end
  end

  def end_object
    if @stack.size == 2
      node = @stack.pop
      #puts "Stack depth: #{@stack.size}. Node: #{node.class}"
      @stack[-1] << node.obj

      # puts "Parsed complete resource: #{node.obj}"
      @yielder << node.obj

    elsif @stack.size == 1
      # puts "Parsed all resources"
      @result = @stack.pop.obj
    else
      node = @stack.pop
      # puts "Stack depth: #{@stack.size}. Node: #{node.class}"
      @stack[-1] << node.obj
    end
  end

  def end_array
    node = @stack.pop
    @stack[-1] << node.obj
  end

  def start_array
    @stack.push(ArrayNode.new)
  end

  def key(key)
    # puts "Stack depth: #{@stack.size} KEY: #{key}"
    @stack[-1] << key
  end

  def value(value)
    node = @stack[-1]
    node << value
  end

  class ObjectNode
    attr_reader :obj

    def initialize
      @obj, @key = {}, nil
    end

    def <<(node)
      if @key
        @obj[@key] = node
        @key = nil
      else
        @key = node
      end
      self
    end
  end

  class ResourceNode < ObjectNode
  end

  # Node that contains all the resources - a Hash keyed by url
  class ResourceCollectionNode < ObjectNode
    def <<(node)
      if @key
        @obj[@key] = node
        # puts "Completed Resource: #{@key} => #{node}"
        @key = nil
      else
        @key = node
      end
      self
    end
  end

  class ArrayNode
    attr_reader :obj

    def initialize
      @obj = []
    end

    def <<(node)
      @obj << node
      self
    end
  end

end

and an example in use:

def json
  <<-EOJ
  {
    "1": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "2": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    },
    "3": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "4": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    },
    "5": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "6": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    }          

  }
  EOJ
end


io = StringIO.new(json)
resource_parser = ResourceParser.new(io, 100)

count = 0
resource_parser.to_enum.each do |resource|
  count += 1
  puts "READ: #{count}"
  pp resource
  break
end

io.close

Output:

READ: 1
{"url"=>"url_1", "title"=>"title_1", "http_req"=>{"status"=>200, "time"=>10}}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
rainkinz
  • 10,082
  • 5
  • 45
  • 73
  • 1
    That looks like a better path. It's a shame that Yajl-Ruby doesn't expose the SAX-like interface, because they claim a big speedup. – the Tin Man Jan 08 '14 at 17:07
5

I faced the same problem and created the gem json-streamer that will save you the need to create your own callbacks.

The usage in your case would be (v 0.4.0):

io = File.open(path_to_file)
streamer = Json::Streamer::JsonStreamer.new(io)
streamer.get(nesting_level:1).each do |object|
  p oject
end
io.close

Applying it on your example it will yield the objects without the 'obj' keys:

{
  "key1": "val1",
  "key2": "val2"
}
thisismydesign
  • 21,553
  • 9
  • 123
  • 126