Reading in very, very large NDJSON

Question

I have a 33GB NDJSON file I need to read into a data.table in R. It's gzipped into a 2GB file, ideally I would like to keep it compressed.

The structure isn't so important except that (when imported via jsonlite::stream_in), the data I need are in only a few simple columns. The vast majority of the weight of the data is held in lists within three columns I want to discard as soon as possible.

My two challenges are: how can I parallelize the read-in, and how can I limit memory usage (right now my worker on this file is using 175GB memory)?

What I'm doing now:

dt.x <- data.table(flatten(stream_in(gzfile("source.gz"))[, -c(5:7)]))

Ideas:

Maybe there is some way to ignore a portion of the NDJSON during stream_in?

Could I parse the gzfile connection, eg with regex, before it goes to stream_in, to remove the excess data?

Can I do something like readLines on the gzfile connection to read the data 1 million lines per worker?

EDIT: If at all possible, my goal is to make this portable to other users and keep it entirely within R.

This might be a good case for [Apache Drill via sergeant](https://rud.is/books/drill-sergeant-rstats/reading-a-streaming-json-ndjson-data-file-with-drill-r.html) — alistaire, Jun 25 '18 at 21:53
Thanks @alistaire, unfortunately my goal for this project is to do this all in R - it needs to be exportable as a script that can simply be used by other teams to process similar data sets. — Chris, Jun 25 '18 at 21:59
Short of @alistaire's suggestion, I think this is a likely candidate for a `jq` pipe, something like `gzip -c file.ndjson.gz | jq `. This is likely the most memory-efficient way you will be able to find, and the fastest (assuming that `jq` is fast, which I believe). It requires a simple installation of the [`jq` utility](https://stedolan.github.io/jq/) and likely an R call to `stream_in(pipe(...))` or such. — r2evans, Jun 25 '18 at 22:00
There is the [`jqr`](https://cran.r-project.org/web/packages/jqr/index.html) package that might be of use, but I don't know if it can read from a compressed file. — r2evans, Jun 25 '18 at 22:01

peak · Answer 1 · 2018-06-26T14:19:42.770

Using jqr with readr

Here is a transcript illustrating how to use jqr to read a gzipped NDJSON (aka JSONL) file:

$ R --vanilla
> library(readr)
> library(jqr)
> read_lines("objects.json.gz") %>% jq('.a')
[
    1,
    2,
    3
]
>

Using read_file() yields the same result. Since these functions must unzip the entire file, the memory requirements will be substantial.

Reading each JSON entity separately

Since the file is NDJSON, we can drastically reduce the amount of RAM required by reading in one JSON entity at a time:

con = file("objects.json", "r");
while ( length(line <- readLines(con, n = 1)) > 0) {
   print( line %>% jq('.a') );
}

jq

There are probably better ways to use jqr, but if the goal is both space and time efficiency, then it might be best to use the command-line version of jq.

Count

If you need to count the number of lines in the (unzipped) file beforehand, then to save memory, I'd probably use system2 and wc if possible; all else failing, you could run a snippet like so:

n<-0;
con = file("objects.json", "r");
while (TRUE) {
   readLines(con, n = 1);
   if (length(line) == 0) { break; }
    n <- n+1;
}

Thanks @peak. I'm struggling a bit to read this output into a pre-allocated data.table, could you please add that to the example? Would also be great if I can read the number of line from the connection first to know how long the allocation should be. Thank you! — Chris, Jun 26 '18 at 08:02
Updated. See also https://stackoverflow.com/questions/23456170/get-the-number-of-lines-in-a-text-file-using-r — peak, Jun 26 '18 at 14:33

Reading in very, very large NDJSON

1 Answers1

Using jqr with readr

Reading each JSON entity separately

jq

Count