16

I receive json-files with data to be analyzed in R, for which I use the RJSONIO-package:

library(RJSONIO)
filename <- "Indata.json"
jFile <- fromJSON(filename)

When the json-files are larger than about 300MB (uncompressed), my computer starts to use the swap memory and continues the parsing (fromJSON) for hours. A 200MB-file takes only about one minute to parse.

I use R 2.14 (64bit) on Ubuntu 64bit with 16GB RAM, so I'm surprised that swapping is needed already at about 300MB of json.

What can I do to read big jsons? Is there something in the memory-settings that mess things up? I have restarted R and run only the three lines above. The json-file contain 2-3 columns with short strings, and 10-20 columns with numbers from 0 to 1000000. I.e. it is the number of rows that makes the large size (more than a million rows in the parsed data).


Update: From the comments I learned that rjson is done more in C, so I tried it. A 300MB file that with RJSONIO (according to Ubuntu System Monitor) reached 100% memory use (from 6% baseline) and went on to swapping, needed only 60% memory with package rjson and the parsing was done in reasonable time (minutes).

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Chris
  • 2,256
  • 1
  • 19
  • 41
  • 1
    I don't have any experience with json files remotely that large, but you might look at the `rjson` package with identically named functions. I've heard that it can be faster than `RJSONIO`. – joran Nov 21 '11 at 18:52
  • 1
    yes, the rjson package has more C code and thus will be faster and likely more memory efficient... although I don't have experience with it either. – JD Long Nov 21 '11 at 19:10
  • 4
    Someone is not telling the truth here (where here = package `Description`). `RJSONIO` actually advertises it's speed performance: `This is an alternative to rjson package. That version was too slow for converting large R objects to JSON and is not extensible, but a very useful prototype. It is fast for parsing.`. Now, _what_ is fast for parsing - `RJSONIO` or `rjson`? Are there any performance tests out there? – aL3xa Nov 21 '11 at 19:52
  • Well FYI, RJSONIO works better for me? But mine is just a small json file < 1MB. rjson costs me few seconds, while RJSONIO finish in a flash. Thus, the RJSONIO description about speed performance is true for me. Also, rjson's output file is not human readable when I open it in Notepad++. Third, I really don't know why rjson can't read the output it generated back to R? My data structure is a nested list consisting of character vectors (possibly empty). In any case, I think it's pretty easy to switch back and forth for finding out the better package for your purpose. – 楊祝昇 Nov 23 '11 at 20:11
  • I realize this question is old, but for those who are searching the internet, please have a look at the benchmarks posted here: http://stackoverflow.com/questions/15308435/rjsonio-vs-rjson-better-tuning – Ricardo Saporta May 13 '13 at 03:09

3 Answers3

6

Although your question doesn't specify this detail, you may want to make sure that loading the entire JSON in memory is actually what you want. It looks like RJSONIO is a DOM-based API.

What computation do you need to do? Can you use a streaming parser? An example of a SAX-like streaming parser for JSON is yajl.

Will Bradley
  • 1,733
  • 15
  • 27
2

Even though the question is very old, this might be of use for someone with a similar problem.

The function jsonlite::stream_in() allows to define pagesize to set the number of lines read at a time, and a custom function that is applied to this subset in each iteration can be provided as handler. This allows working with very large JSON-files without reading everything into memory at the same time.

stream_in(con, pagesize = 5000, handler = function(x){
    # Do something with the data here
})
tobiasegli_te
  • 1,413
  • 1
  • 12
  • 18
0

Not on the memory size, but on the speed, for the quite small iris dataset (only 7088 bytes), the RJSONIO package is an order of magnitude slower than rjson. Don't use the method 'R' unless you really have to! Note the different units in the two sets of results.

library(rjson) # library(RJSONIO)
library(plyr)
library(microbenchmark)
x <- toJSON(iris)
(op <- microbenchmark(CJ=toJSON(iris), RJ=toJSON(iris, method='R'),
  JC=fromJSON(x), JR=fromJSON(x, method='R') ) )

# for rjson on this machine...
Unit: microseconds
  expr        min          lq     median          uq        max
1   CJ    491.470    496.5215    501.467    537.6295    561.437
2   JC    242.079    249.8860    259.562    274.5550    325.885
3   JR 167673.237 170963.4895 171784.270 172132.7540 190310.582
4   RJ    912.666    925.3390    957.250   1014.2075   1153.494

# for RJSONIO on the same machine...
Unit: milliseconds
  expr      min       lq   median       uq      max
1   CJ 7.338376 7.467097 7.563563 7.639456 8.591748
2   JC 1.186369 1.234235 1.247235 1.265922 2.165260
3   JR 1.196690 1.238406 1.259552 1.278455 2.325789
4   RJ 7.353977 7.481313 7.586960 7.947347 9.364393
Sean
  • 3,765
  • 3
  • 26
  • 48