I am using the following code to read an extremely large json file (90GB):
library(jsonlite)
library(dplyr)
con_in <- file("events.json")
con_out <- file("event-frequencies1.json", open = "wb")
stream_in(con_in, handler = function(df) {
df <- df[df$`rill.message/cursor` > 23000000, ]
stream_out(df, con_out)
})
close(con_out)
My code works fine as far as I can see, but the problem is that I need data from the middle of the file, but to reach the middle of the file takes hours with the code above. Is there any way to start reading/processing the file from a certain offset (lets say the middle of the file)? I am thinking of starting line number or byte offset?
If it is not going to work with stream_in(), what would be the best way to process such a big file? I need to select certain lines from this JSON, and put it into a dataframe, or alternatively create a smaller JSON with the selected lines?