I have a 33GB NDJSON file I need to read into a data.table in R. It's gzipped into a 2GB file, ideally I would like to keep it compressed.
The structure isn't so important except that (when imported via jsonlite::stream_in
), the data I need are in only a few simple columns. The vast majority of the weight of the data is held in list
s within three columns I want to discard as soon as possible.
My two challenges are: how can I parallelize the read-in, and how can I limit memory usage (right now my worker on this file is using 175GB memory)?
What I'm doing now:
dt.x <- data.table(flatten(stream_in(gzfile("source.gz"))[, -c(5:7)]))
Ideas:
Maybe there is some way to ignore a portion of the NDJSON during stream_in
?
Could I parse the gzfile
connection, eg with regex, before it goes to stream_in
, to remove the excess data?
Can I do something like readLines
on the gzfile
connection to read the data 1 million lines per worker?
EDIT: If at all possible, my goal is to make this portable to other users and keep it entirely within R.