1

I'm working with WikiData (a cross-referencing of multiple data sources, including Wikipedia) and they provide a ~50 GB JSON file with no white space. I want to extract certain kinds of data from it, which I could do with grep if it was pretty printed. I'm running on a mac.

Some methods of reformatting, e.g.,

 cat ... | python -m json.too
 ./jq . filename.json

Will not work on a large file. python chokes. jq dies. There was a great thread here: How can I pretty-print JSON in (unix) shell script? But I'm not sure how/if any can deal with large files.

This company uses "Akka streams" to do this very task (they claim <10 minutes to process all Wikidata), but I know nothing about it: http://engineering.intenthq.com/2015/06/wikidata-akka-streams/

Wikidata has a predictable format (https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON), and I am able to accomplish most of my goal by piping through a series of sed and tr, but it's clumsy and potentially error-prone, and I'd much prefer to be grepping on a prettyprint.

Any suggestions?

Community
  • 1
  • 1
some ideas
  • 64
  • 3
  • 14
  • By "certain kinds" you mean what? Can you give an example of the `grep`? It sounds like you need the white space from the pretty print to create your match criteria? Shouldn't you be able to use a regex to deal with having no white space? I'm guessing that since there is no white space, there are no `` characters, is that true? – Beartech Jul 28 '15 at 16:57
  • There are zero linefeeds, and forcing line feeds after JSON delimiters is too sloppy. The Wikidata is very predictable, so for example, I could search for lines containing '"language":"en"' or '"site":"enwiki"' and get what I need. – some ideas Jul 28 '15 at 19:07
  • 1
    why not piping through a sed/awk (or whatever) to helping a bit your grep. By ex to cut at start of section that could match your criteria (potentialy). some option like `sed -u` does not buffer the lines keeping memory low for this. – NeronLeVelu Jul 29 '15 at 05:50
  • Hi Neron, thanks for mentioning sed -u; though that's not available on the Mac. But I did end up using a series of sed's. – some ideas Jul 30 '15 at 16:10

2 Answers2

3

There are several libraries out there for parsing JSON streams, which I think is what you want—you can pipe the JSON in and deal with it as a stream, which saves you from having to load the whole thing into memory.

Oboe.js looks like a particularly mature project, and the docs are very good. See the "Reading from Node.js streams" and "Loading JSON trees larger than the available RAM" sections on this page: http://oboejs.com/examples

If you'd rather use Ruby, take a look at yajl-ruby. The API isn't quite as simple as Oboe.js's, but it ought to work for you.

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • Thanks. I don't have node installed on my mac. Do you know offhand how to write that ruby as a commandliner? I believe I was able to install yajl-ruby-1.2.1 on my mac with _sudo gem install yajl-ruby_ – some ideas Jul 28 '15 at 19:16
1

You could try this, it looks like it lets you just pipe in your JSON file and it will output a grep friendly file...

json-liner

Beartech
  • 6,173
  • 1
  • 18
  • 41