7

The command

$ xmlstarlet sel -t -c "/collection/record" file.xml

seems to load the whole file into memory, before applying the given Xpath expression. This is not usable for large XML files.

Does xmlstarlet provide a streaming mode to extract subelements from a large (100G+) XML file?

miku
  • 181,842
  • 47
  • 306
  • 310
  • You might also consider a database system like XBase or eXist that offer XQuery (a superset of XPath) on XML data. – Martin Honnen Nov 11 '15 at 15:48
  • @MartinHonnen Thanks, I'm a bit hesitant on introducing an extra component. In the end I only need to select parts of an XML file for later processing, no advanded queries. – miku Nov 11 '15 at 15:51
  • 1
    Using xml database would be interesting if you are doing the same operations over and over again and often. XML database would save you time needed to parse and search in xml parsed tree: parsing would be done only once ruing import and one can define additional indices. That said, it's not easy or straightforward to select or tweak such database as xml was not designed for database purposes. – marbu Nov 11 '15 at 17:06

2 Answers2

9

Since I only needed a tiny subset of XPath for large XML files, I actually implemented a little tool myself: xmlcutty.

The example from my question could be written like this:

$ xmlcutty -path /collection/record file.xml
miku
  • 181,842
  • 47
  • 306
  • 310
7

Xmlstarlet translates all (or most) operations into xslt transformations, so the short answer is no.

You could try to use stx, which is streaming transformation language similar to xslt. On the other hand, just coding something together in python using sax or iterparse may be easier and faster (wrt time needed to create code) if you don't care about xml that much.

marbu
  • 1,939
  • 2
  • 16
  • 30
  • Thanks, I though so and I have also thought about writing a small tool - just hoped there was some tool, I missed. – miku Nov 11 '15 at 15:42
  • 1
    I think that lack of generic xml streaming tools (it's just my personal guess though) is caused by the sheer number of xml features and standards. Full set of features of most xml standards is not possible to implement in streaming friendly way. – marbu Nov 11 '15 at 17:00
  • Yes, probably. But even for lighter tasks like XML splitting there are only a few, relatively unknown tools, like "xml_split". It's a bit depressing. – miku Nov 11 '15 at 17:06
  • 1
    The perl module XML::Twig (which programs like xml_grep,and xml_split are build with that), able to handle very big files with comparably little memory, reasonably easy to quickly write some programs. On linux in the package perl-XML-Twig. – PBI Feb 03 '16 at 23:25