How to start stream_in() not from the begining of the file

Question

I am using the following code to read an extremely large json file (90GB):

library(jsonlite)
library(dplyr)
con_in <- file("events.json")
con_out <- file("event-frequencies1.json", open = "wb")
stream_in(con_in, handler = function(df) {
    df <- df[df$`rill.message/cursor` > 23000000, ]
    stream_out(df, con_out)
})
close(con_out)

My code works fine as far as I can see, but the problem is that I need data from the middle of the file, but to reach the middle of the file takes hours with the code above. Is there any way to start reading/processing the file from a certain offset (lets say the middle of the file)? I am thinking of starting line number or byte offset?

If it is not going to work with stream_in(), what would be the best way to process such a big file? I need to select certain lines from this JSON, and put it into a dataframe, or alternatively create a smaller JSON with the selected lines?

You can't really jump/seek to the middle of a JSON file. You would have no idea where you are in the object structure unless you parse everything before it. If you want to avoid the parsing overhead and treat the file like a string, you might be able to `seek()` ahead within the file and search for certain strings, but that would basically be ignoring the JSON structure. — MrFlick, Apr 28 '16 at 19:27
Ah, i didn't realize that `stream_id` requires the ndjson format. Actually you should be able to skip ahead to the next newline then. It would help to have a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to test — MrFlick, Apr 28 '16 at 19:30

score 0 · Accepted Answer · answered Apr 28 '16 at 19:42

0

You should be able to seek() on the file connection to start reading at whatever byte you like. For Example

con_in <- file("myfile.json")
open(con_in)
# skip ahead 300 bytes
seek(con_in,300)
# read till end of line so stream_in will start on a fresh new line
throwaway <- readLines(con_in,1) 

stream_in(con_in, handler = function(df) {
    print(df)
})

close(con_in)

answered Apr 28 '16 at 19:42

MrFlick

195,160
17
277
295

Thank you for your quick help! `seek()` was the function I was looking for. – Aniko Nagy Apr 28 '16 at 20:34

How to start stream_in() not from the begining of the file

1 Answers1