(Solved) How to read a 100+GB file with jq without running out of memory

Question

I have a 100+GB json file and when I try to read it with jq my computer keeps running our of ram. Is there a way to read the file while limiting the memory usage or some other way to read a VERY huge json file?

What I typed in the command: jq 'keys' fileName.json

Try the `--stream` option. It will handle big inputs by breaking them down into smaller, manageable parts. However, you would need to rewrite your filters as processing them is a little bit different. See the [Streaming](https://stedolan.github.io/jq/manual/#Streaming) section in the manual, especially `truncate_stream` and `fromstream`. — pmf, Oct 15 '22 at 02:05
What happens if I have no idea what's in the file to see how it's structured? — KTK, Oct 15 '22 at 02:14
`jq` may not be the right tool for this job. It looks like there exist parsers in various languages based on [`yajl`](https://lloyd.github.io/yajl/), which is an event based parser; that may provide an alternative that can handle very large JSON inputs. — larsks, Oct 15 '22 at 02:20
`jq` is a perfect fit for this job. For instance, have a look at @peak's `schema.jq` https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed Use it as `jq --arg nullable true 'include "schema"; schema' yourfile.json` — pmf, Oct 15 '22 at 02:44
When I tried `jq --arg nullable true 'include "schema"; schema' yourfile.json` it given me this error: `jq: error: syntax error, unexpected IDENT, expecting FORMAT or QQSTRING_START (Windows cmd shell quoting issues?) at , line 1: include schema; schema jq: 1 compile error` — KTK, Oct 15 '22 at 05:02
@pmf - If the file is too big, running `schema(inputs)` might work. Please see the "example" section of my attempted answer below. — peak, Oct 15 '22 at 09:02

peak · Answer 1 · 2023-02-23T00:07:28.957

jq's streaming parser (invoked using the --stream option) can generally handle very, very large files (and even arbitrarily large files provided certain conditions are met), but it is typically very slow and often quite cumbersome.

In practice, I find that tools such as jstream and/or my own jm work very nicely in conjunction with jq when dealing with ginormous files. When used this way, they are both very easy to use, though installation is potentially a bit of a hassle.

Unfortunately, if you know nothing at all about the contents of a JSON file except that jq empty takes too long or fails, then there is no CLI-tool that I know of that can produce a useful schema automagically. However, looking at the first few bytes of the file will usually provide enough information to get going. Or you could start with jm count to give a count of the top-level objects, and go from there. jm -s | jq 'keys[]' will give you the list of top-level keys if the top-level is a JSON object.

Here's an example. Suppose we have determined that the large size of the file ginormous.json is primarily because it consists of a very long top-level array. Then assuming that schema.jq (already mentioned elsewhere on this page) is in the pwd, you have some hope of finding an informative schema by running:

jm ginormous.json |
  jq -n 'include "schema" {source:"."}; schema(inputs)'

See also jq to recursively profile JSON object for a simpler schema-inference engine.

knittl · Answer 2 · 2022-11-10T16:22:42.270

I posted a related question here: Difference between slurp, null input, and inputs filter

If your file is huge, but the documents inside the file aren't that big (just many many smaller ones), jq -n 'inputs' could get you started:

jq -n 'inputs | keys'

Here's an example (with a small file):

$ jq -n 'inputs | keys' <<JSON
{"foo": 21, "bar": "less interesting data"}
{"foo": 42, "bar": "more interesting data"}
JSON
[
  "bar",
  "foo"
]
[
  "bar",
  "foo"
]

This approach will not work if you have a single top-level object that is gigabytes big or has millions of keys.

peak · Answer 3 · 2022-10-18T03:09:40.890

One generic way to determine the structure of a very large file containing a single JSON entity would be to run the following query:

jq -nc --stream -f structural-paths.jq huge.json | sort -u

where structural_paths.jq contains:

inputs
| select(length == 2)
| .[0]
| map( if type == "number" then 0 else . end )

Note that the '0's in the output signify that there is at least one valid array index at the corresponding position, not that '0' is actually a valid index at that position.

Note also that for very large files, using jq --stream to process the entire file could be quite slow.

Example:

Given {"a": {"b": [0,1, {"c":2}]}}, the result of the above incantation would be:

["a","b",0,"c"]
["a","b",0]

Top-level structure

If you just want more information about the top-level structure, you could simplify the above jq program to:

inputs | select(length==1)[0][0] | if type == "number" then 0 else . end

Structure to a given depth

If the command-line sort fails, then you might want to limit the number of paths by considering them only to a certain depth.

If the depth is not too great, then hopefully your command-line sort will be able to manage; if not, then using the command-line uniq would at least trim the output somewhat.

A better option might be to define unique(stream) in jq, and then use it, as illustrated here:

# Output: a stream of the distinct `tostring` values of the items in the stream
def uniques(stream):
  foreach (stream|tostring) as $s ({};
     if .[$s] then .emit = false else .emit = true | .item = $s | .[$s]=true end;
     if .emit then .item else empty end );

def spaths($depth):
  inputs
  | select(length==1)[0][0:$depth]
  | map(if type == "number" then 0 else . end);

uniques(spaths($depth))

A suitable invocation of jq would then look like:

jq -nr --argjson depth 3 --stream -f structural-paths.jq huge.json

Beside avoiding the costs of sorting, using uniques/1 will preserve the ordering of paths in the original JSON.

"JSON Pointer" pointers

If you want to convert array path expressions to "JSON Pointer" strings (e.g. for use with jm or jstream), simply append the following to the relevant jq program:

| "/" + join("/")

It has an error `sort : Array dimensions exceeded supported range. At line:1 char:56 + jq -nc --stream -f structural_paths.jq hugeFile.json | sort -u + ~~~~~~~ + CategoryInfo : NotSpecified: (:) [Sort-Object], OutOfMemoryException + FullyQualifiedErrorId : System.OutOfMemoryException,Microsoft.PowerShell.Commands.SortObjectCommand` — KTK, Oct 15 '22 at 21:29
@KTK - See the new section: "Structure to a given depth". I'm not familiar with the limitations of PowerShell's sort, but if your machine has a decent amount of memory, you might wish to consider WSL. — peak, Oct 16 '22 at 02:25

(Solved) How to read a 100+GB file with jq without running out of memory

3 Answers3

Example:

Top-level structure

Structure to a given depth

"JSON Pointer" pointers

Linked