How to pretty print a 48 GB JSON? (Wikidata)

Question

I'm working with WikiData (a cross-referencing of multiple data sources, including Wikipedia) and they provide a ~50 GB JSON file with no white space. I want to extract certain kinds of data from it, which I could do with grep if it was pretty printed. I'm running on a mac.

Some methods of reformatting, e.g.,

 cat ... | python -m json.too
 ./jq . filename.json

Will not work on a large file. python chokes. jq dies. There was a great thread here: How can I pretty-print JSON in (unix) shell script? But I'm not sure how/if any can deal with large files.

This company uses "Akka streams" to do this very task (they claim <10 minutes to process all Wikidata), but I know nothing about it: http://engineering.intenthq.com/2015/06/wikidata-akka-streams/

Wikidata has a predictable format (https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON), and I am able to accomplish most of my goal by piping through a series of sed and tr, but it's clumsy and potentially error-prone, and I'd much prefer to be grepping on a prettyprint.

Any suggestions?

By "certain kinds" you mean what? Can you give an example of the `grep`? It sounds like you need the white space from the pretty print to create your match criteria? Shouldn't you be able to use a regex to deal with having no white space? I'm guessing that since there is no white space, there are no `` characters, is that true? — Beartech, Jul 28 '15 at 16:57
There are zero linefeeds, and forcing line feeds after JSON delimiters is too sloppy. The Wikidata is very predictable, so for example, I could search for lines containing '"language":"en"' or '"site":"enwiki"' and get what I need. — some ideas, Jul 28 '15 at 19:07
why not piping through a sed/awk (or whatever) to helping a bit your grep. By ex to cut at start of section that could match your criteria (potentialy). some option like `sed -u` does not buffer the lines keeping memory low for this. — NeronLeVelu, Jul 29 '15 at 05:50
Hi Neron, thanks for mentioning sed -u; though that's not available on the Mac. But I did end up using a series of sed's. — some ideas, Jul 30 '15 at 16:10

score 3 · Answer 1 · answered Jul 28 '15 at 17:24

There are several libraries out there for parsing JSON streams, which I think is what you want—you can pipe the JSON in and deal with it as a stream, which saves you from having to load the whole thing into memory.

Oboe.js looks like a particularly mature project, and the docs are very good. See the "Reading from Node.js streams" and "Loading JSON trees larger than the available RAM" sections on this page: http://oboejs.com/examples

If you'd rather use Ruby, take a look at yajl-ruby. The API isn't quite as simple as Oboe.js's, but it ought to work for you.

Thanks. I don't have node installed on my mac. Do you know offhand how to write that ruby as a commandliner? I believe I was able to install yajl-ruby-1.2.1 on my mac with _sudo gem install yajl-ruby_ — some ideas, Jul 28 '15 at 19:16

score 1 · Answer 2 · answered Jul 31 '15 at 05:14

1

You could try this, it looks like it lets you just pipe in your JSON file and it will output a grep friendly file...

json-liner

answered Jul 31 '15 at 05:14

Beartech

6,173
1
18
41

How to pretty print a 48 GB JSON? (Wikidata)

2 Answers2