3

I am trying to parse a JSON object that consists of a few hashes and a massive array of hashes (sometimes 300,000 hashes inside the array, 200MB). Here is an example of the JSON object. I need to parse hash by hash inside the array report_datasets.

https://api.datacite.org/reports/0cb326d1-e3e7-4cc1-9d86-7c5f3d5ca310

{ report_header: {report_id: 33738, report_name: "Report first"},
  report_datasets: [
  {dataset_id:1, yop:1990},
  {dataset_id:2, yop:2007},
  {dataset_id:3,  yop:1983},
  .
  {dataset_id:578999,yop:1964},
  ]
}

In every approach I tried, including a few approaches using yajl-ruby and json-streamer, my app is killed. When I use parse_chunk,

def parse_very_large_json
        options= {symbolize_keys:false}
        parser = Yajl::Parser.new(options)
        parser.on_parse_complete = method(:print_each_item)

        report_array = parser.parse_chunk(json_string) 
end

def print_each_item report
      report["report-datasets"].each do |dataset|
      puts “this is an element of the array“
      puts dataset
    end
end

parsing happens, but eventually again it is killed.

The problem seems to be that there is not much difference between Yajl::Parser.new().parse and Yajl::Parser.new().parse_chunk in both approaches that are killed.

How can one parse the elements of such a massive JSON array efficiently without killing the rails app?

sawa
  • 165,429
  • 45
  • 277
  • 381
kriztean
  • 229
  • 3
  • 13
  • any error message from the OS when your app is killed? – Raj Nov 26 '18 at 08:17
  • nada @emaillenin, I have been running my tests with `rspec` and I can see the whole parser printing each element of the array til suddenly I get a `Killed` stdout. that's it. – kriztean Nov 26 '18 at 08:20
  • 2
    can you try Oj gem? – Raj Nov 26 '18 at 08:23
  • that seems like an option but i need to give it a read. It seems a more like building your own parser http://www.ohler.com/oj/doc/Oj/ScHandler.html . have you use it ? – kriztean Nov 26 '18 at 08:52
  • Did you tried approach described in [`Ruby: reading, parsing and forwarding large JSON files in small chunks (i.e. streaming)`](https://coderwall.com/p/l1omyw/ruby-reading-parsing-and-forwarding-large-json-files-in-small-chunks-i-e-streaming) article? It uses Oj::ScHandler parser but [`Yajl::FFI::Parser` also supports streaming](https://stackoverflow.com/a/52876204/4950680) – Martin Nov 26 '18 at 11:29
  • I'm going to try the `Oj`approach. I have read the `Yajl::FFI::Parser` approach but this seem mostly to rely on getting the json object from a request (EventMachine). That is not my case. I already have the json object, I just need to parse the json object. – kriztean Nov 27 '18 at 10:31

0 Answers0