14

I have a collection of XML files, and some of them are pretty big (up to ~50 million element nodes). I am using xmllint for validating those files, which works pretty nicely even for the huge ones thanks to the streaming API.

xmllint --loaddtd --stream --valid /path/to/huge.xml

I recently learned that xmllint is also capable of doing command line XPath queries, which is very handy.

xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml

However, these XPath queries do not work for the huge XML files. I just receive a "Killed" message after some time. I tried to enable the streaming API, but this just leads to no output at all.

xmllint --loaddtd --stream --xpath '/root/a/b/c/text()' /path/to/huge.xml

Is there a way to enable streaming mode when doing XPath queries using xmllint? Are there other/better ways to do command line XPath queries for huge XML files?

MRA
  • 2,992
  • 1
  • 16
  • 18
  • try `--shell` option for interactive (with just the xml file path) – flafoux May 18 '15 at 14:42
  • I tried opening the interactive shell for a huge file, but it will crash ("Killed", just as in the case of not using `--stream`) before I can enter any command. – MRA May 18 '15 at 15:00
  • http://superuser.com/questions/543881/efficiently-extracting-a-few-data-from-a-large-xml-file – Ciro Santilli OurBigBook.com Oct 07 '15 at 12:57
  • 1
    attaching a sample XML file would help – I, for one, have no idea what **large** might mean in your case. – Eduard Sukharev Jan 30 '16 at 09:42
  • 1
    Think of something like the dblp XML dump (http://dblp.dagstuhl.de/xml/). I receive the "Killed" error when parsing that file in a non-streaming context. But my question is aimed at essentially any file that is big enough such that you would be ill advised to build a DOM in main memory and should rather use a streaming approach instead. – MRA Feb 01 '16 at 10:33
  • What about using [XSLT 3.0 streaming functions](http://www.stylusstudio.com/tutorials/intro-xslt-3.html) for that? It could be more predictable and safer. – Honza Hejzl Mar 18 '16 at 09:29
  • Internally, `libxml2` has some support for streaming XPath expressions, but `xmllint` (the command-line interface to `libxml2`) doesn't support the combination of `--xpath` and `--stream`. – nwellnhof Oct 28 '16 at 12:24

2 Answers2

5

If your XPath expressions are very simple, try xmlcutty.

From the homepage:

xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.

gioele
  • 9,748
  • 5
  • 55
  • 80
  • 1
    A command like `xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml` would be translated into `xmlcutty -path '/root/a/b/c' -rename '\n' /path/to/small.xml` - where the *rename* is meant to rename the last enclosing element - and thus simulating a `text()` - the syntax is bit arcane. – miku Sep 14 '17 at 06:18
-1

change ulimits might work. Try this:

$ ulimit -Sv 500000
$ xmllint (...your command)
Leonardo Alves Machado
  • 2,747
  • 10
  • 38
  • 53
ajslaghu
  • 1
  • 1