1

I was trying to do some exploratory analyses on a large (2.7 GB) JSON dataset using R, however, the file doesn't even load in the first place. When looking for solutions, I saw that I could process the data in smaller chunks, namely by iterating through the larger file or by down-sampling it. But I'm not really sure how to do that with a JSON dataset. I also thought of converting the original JSON data into .csv, but after having a look around that option didn't look that helpful.

Any ideas here?

Community
  • 1
  • 1
VMacuchS
  • 11
  • 4

1 Answers1

1

The jsonlite R package supports streaming your data. In that way there is no need to read all the json data into memory. See the documentation of jsonlite for more details, the stream_in function in particular.


Alternatively:

I would dump the json into a mongo database and process the data from that. You need to install mongodb, and start running mongod. After that you can use mongoimport to import the json file into the database.

After that, you can use the mongolite package to read data from the database.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • I'll certainly try that out, thanks. – VMacuchS Apr 22 '17 at 19:24
  • A full database installation might be overkill depending on the need for persistence or using the built-in query functions that it provides. – OneCricketeer Apr 22 '17 at 19:28
  • @cricket_007 You are right, although getting mongo running is not that hard. I added an alternative that uses streaming processing. That is probably even easier than using a mongodb as the stream already takes care of all the reading in chunks automatically. – Paul Hiemstra Apr 22 '17 at 19:34
  • Running mongo in docker wouldn't be hard, no :) I'm just saying seems like a hammer – OneCricketeer Apr 22 '17 at 19:35
  • I agree, `jsonlite` allows you to stream json, which is a much more lightweight solution. – Paul Hiemstra Apr 22 '17 at 19:36
  • I'll have a look at jsonlite (I was using rjson before). I'll let you know if it works. Should solve my problem, though. – VMacuchS Apr 22 '17 at 19:37
  • @Paul Hiemstra Update: I've tried using the `stream_in` function. It does work for a while but R eventually crashes after some time. Would this be due to processing issues or problems with the actual data? – VMacuchS Apr 26 '17 at 12:45
  • I recommend you ask a new question with a reproducible example of this issue. Than people can help you out. – Paul Hiemstra Apr 28 '17 at 07:00