1

I am trying to read a 25 GB nested JSON file in R. I am using the stream_in function from jsonlite library as follows:

stream_in(file("/data/user-data/ma8994/25GB_nestedJSON_file.json"))

RStudio in my local machine crashed after an hour of reading.

Is there any way to read this big file in R?

Simone
  • 20,302
  • 14
  • 79
  • 103
  • 1
    Can you split your jsons in smaller chunks? if it not the case, you can try `rjson::fromJSON()` – amonk Oct 19 '17 at 09:50
  • 1
    Possible duplicate: https://stackoverflow.com/questions/8216743/how-to-read-big-json – Sergey Shubin Oct 19 '17 at 09:55
  • Not sure how your libraries work, but it might be due to memory limitations. – Mark Baijens Oct 19 '17 at 10:00
  • How much memory is on your system? This seems to be an ndjson file given your use of `stream_in()`. I'd *highly* suggest using Apache Drill and the `sergeant` package for this type of work even if you think you've got plenty of memory in your R session. I'd also counsel using Drill to convert the ndjson to parquet and still use Drill as a dplyr back-end. – hrbrmstr Oct 19 '17 at 11:04
  • once i accomplish doing that, and get a huge dataframe. how should I programme in R with such huge data ? what do the expert suggest in such scenario? is going for parallel programming packages in R a good option ? or should another platform be adapted ? – user3516188 Oct 19 '17 at 11:16
  • I plan to extract some features from this dataset initially and later apply some sentiment analysis @ TOBIASEGLI_TE – user3516188 Oct 19 '17 at 11:18
  • @hrbrmstr I haven't got much memory just 16GB. I have tried accomplishing it on Rstudio server .. which wasn't doing well enough as well. Thanks for the awesome suggestions, never heard of apache drill and sergeant before. Looking into some tutorials now. If you can suggest any too :) – user3516188 Oct 19 '17 at 11:24
  • go to my blog and poke around or look at the ones in the sergeant package. You can't read in the data into R w/o at least 3x the memory (to accommodate processing it afterwards). Drill may help (it won't read it all into memory) but you will likely — at some point — need a bigger system depending on what you're going to do with the data. I'd def convert it to parquet with Drill and use the parquet as the data source in drill. it'll be much faster. – hrbrmstr Oct 19 '17 at 11:33
  • @user3516188 once you have a huge data frame in your workspace, you could try converting to `data.table` that is much more efficient in handling big data. – Paul Lemmens Oct 19 '17 at 11:54
  • @ hrbrmstr i would now be doing it on a server with 50GB, which hopefully would be useful. Thanks for the referrals. – user3516188 Oct 19 '17 at 12:17

0 Answers0