0

I have to merge ~1000 large json files (1M .. 500M) into a single file (~80GB) on Ubuntu 18. According to this SO question, I use jq with

 jq -s 'reduce .[] as $item ({}; . * $item)' ~/ml/train-*.json > train.json

which works quite nice for less smaller files.

The machine where merging happens is a 32 core server with 128 GB RAM. Alas, the task ends with a

Killed

statement, but in terms of memory, the resources should be sufficient. Can somebody please give me some advice how to manage this task? Thnx

peak
  • 105,803
  • 17
  • 152
  • 177
x y
  • 911
  • 9
  • 27
  • You can invoke the streaming parser of `jq`, to avoid in-memory storage, but don't think `jq` has multiprocessing capabilities to make use of multiple cores. One way would be to use the command from jq - Cookbook - https://github.com/stedolan/jq/wiki/Cookbook#processing-huge-json-texts – Inian Oct 01 '20 at 12:09
  • Try `jq -cn --stream ' def atomize(s): fromstream(foreach s as $in ( {previous:null, emit: null}; if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null then {emit: [[.previous]], previous: $in|.[0][0]} else { previous: ($in|.[0][0]), emit: null} end; (.emit // empty), $in) ) ; atomize(inputs)' ~/ml/train-*.json` and if the resulting file is too large to be stored on disk, pipe the output to `gzip` to compress i.e. `jq .. | gzip > result.gz` and then later use `zcat` to parse the compressed file – Inian Oct 01 '20 at 12:12

1 Answers1

0

Using -s is here both unnecessary and asking for trouble. You could try using inputs with the -n option instead. Also I suspect you should be using + rather than *. That should also require less computer resources:

jq -n 'reduce inputs as $item (null; . + $item)'

The real question, though, is whether you really need to produce a single JSON entity, rather than a stream of JSON entities.

Relevance of jq's streaming parser

jq's streaming parser (the one invoked with the --stream command-line option) is primarily targeted to reading very large JSON inputs, and is thus almost certainly irrelevant with respect to the problem being described. On the other hand, it might be relevant when it comes to reading the generated file.

For the record, the other primary use of the streaming parser is handling inputs which include JSON objects with duplicated keys. The regular jq parser uses a "right-most" semantics for such objects.

peak
  • 105,803
  • 17
  • 152
  • 177