What XML parser should I use in C++?

Question

I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?

Note: This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.

I like tiCpp http://code.google.com/p/ticpp/, the docs aren't great (yet?), but I love the library, nice clean code. — , Feb 22 '12 at 00:59
Why this question is closed ? Nobody talked about XSD validation yet !!! In case, one needs XSD validation, one definitely should use Xerces or libxml2 because other parsers (pugixml, RapidXml, TinyXML and Expat) don't have any. But personaly, I would advice to use Xerces to use XSD. My brief experience with Xerces and libxml2 made me realize that Xerces was (way) faster. Has anybody ended on the same conclusion ? — Tesla123, Mar 07 '23 at 13:47
I really enjoy that I spend half my time here reading questions that have been closed because they ask for recommendations and produce a list. I typically find the such posts extremely useful, and, like this one, well stocked with "facts and citations" (despite being closed). I continue to believe that the rule against such is misconceived resulting in many clean and beautiful babies being tossed out. My gratitude to the SO community for all the care and effort put into answering such questions. — Spike0xff, May 16 '23 at 15:17

score 752 · Answer 1 · edited Jun 20 '20 at 09:12

Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:

enter image description here

So the first question is this: What do you need?

I Need Full XML Compliance

OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.

The next question is: Does your API need to conform to DOM or SAX?

I Need Exact DOM and/or SAX Conformance

OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.

You have chosen:

Xerces

That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.

It has no real dependencies. It uses the Apache license.

I Don't Care About DOM and/or SAX Conformance

You have chosen:

LibXML2

LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.

It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.

It uses the MIT license.

I Do Not Need Full XML Compliance

OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.

So what does matter to you? The next question is: What is the most important thing to you in your XML work?

Maximum XML Parsing Performance

Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.

You have chosen:

RapidXML

This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.

Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.

Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).

RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.

One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.

It uses the MIT licence. It is a header-only library with no dependencies.

There is a RapidXML "GitHub patch" that allows it to also work with namespaces.

I Care About Performance But Not Quite That Much

Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.

You have chosen:

PugiXML

Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.

PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.

But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.

Reading Huge Documents

You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.

You have chosen:

LibXML2

Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.

LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.

Alternatives

Expat

Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.

It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).

LlamaXML

It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.

But it hasn't been updated since 2005. So again, Caveat Emptor.

XPath Support

XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.

There are effectively three choices here:

LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.

Just Get The Job Done

So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.

You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.

You have chosen:

TinyXML

I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.

Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.

There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.

TinyXML uses the zLib license, which is more or less the MIT License with a different name.

This looks a bit like a copy-paste. Can you link the source document? — Joel, Feb 22 '12 at 00:49
@Joel: quite often when someone answers their own question with a good long post, it is because they are following in the spirit of [Jeff's advice](http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/) -- especially because what looks like a so-so question can often be closed before a good answer can be posted, if the person is writing the answer right then and there. By taking some time to prepare a response before he asked the question :) Nicol is providing us _all_ with an excellent candidate for Close->Duplicate questions in the future. — sarnold, Feb 22 '12 at 00:52
@Nicol: I'd love to see a small mention of benefits of DOM vs SAX styles. — sarnold, Feb 22 '12 at 00:58
@Joel: I'm afraid I can't. It was just a temporary document I copied from in Notepad++. I never saved it, so I can't link you to it ;) — Nicol Bolas, Feb 22 '12 at 01:11
Just wanted to call out my own expatpp model for making SAX style parsers a lot easier to use by writing nesting parsers. Whilst it's written to the expat interface, the model can be easily adapted to other event-driven parsers. I'm told that Valve used it in Steam (I only knew when I read about this in DDJ and the book "C++ XML" - I guess a sign of a good library is not being told ;-). Currently hosted on Sourceforge www.expatpp.com. Yeah it's old but that's because I was solving problems in XML back in '97. — Andy Dent, Apr 11 '13 at 05:25
Might be worth mentioning newer version of TinyXML: *TinyXML-2 uses a similar API to TinyXML-1 and the same rich test cases. But the implementation of the parser is completely re-written to make it more appropriate for use in a game. It uses less memory, is faster, and uses far few memory allocations.* — johnbakers, Apr 24 '13 at 01:00
YOu forget to mention vtd-xml... it is conformant yet high performance and low mem usage (better than pugi and rapidXML) and support XPath .. — vtd-xml-author, Jul 07 '13 at 03:43
I find the [XmlTextReader](http://xmlsoft.org/xmlreader.html) interface that libxml2 offers very convenient, and might I say so the performance also seem excellent. — Martin Ba, Oct 28 '13 at 21:08
I like this question and answer, but find it too Unix-biased. No mention of MSXML and XmlLite? If multi-paltform portability is your reason for excluding those, then this should be clearly mentioned in the question and answer. (Otherwise some people might end up choosing e.g. Libxml2 for a Windows-only project, which is asking for headaches that could have easily been avoided.) — Scrontch, Dec 19 '13 at 14:01
@NicolBolas As OpenLearner mentioned there is [TinyXML-2](http://www.grinninglizard.com/tinyxml2docs/index.html) (which has been in development [since December 2011](https://github.com/leethomason/tinyxml2/commit/e13c3e653d3887f0a736d5da36bc367cac69755a)). Is there any possibility it could be included as a part of this answer? I'm not sure how it ranks, but that's what brought me here. — monkey0506, Mar 19 '14 at 23:55
@Joel Pandoc magic: http://downloads.sehe.nl/stackoverflow/q9387610.odt — sehe, Apr 21 '15 at 15:05
@NicolBolas, regarding the c-interface of libxml2, there's a wrapper in C++ called libxml++ and it's mentioned in libxml2 faq(10). That means same functionality of libxml2 and C++ interface. I used it under Linux a lot of years ago. — AndrewBloom, Jun 20 '15 at 07:54
Personally, I chose RapidXML because it's header-only and could be ported to Android NDK without changing anything! — John Hany, Mar 31 '16 at 07:30
Just want to confirm RapidXML is outperforming the fastest streaming parsers (expat, libxml2) in my tests by a factor of 3! — rustyx, Jun 10 '16 at 11:04
A little update: the latest benchmarks on the website of pugixml shows that pugixml now outperforms rapidxml in parsing time and memory usage. http://pugixml.org/benchmark.html — plasmacel, Nov 01 '16 at 02:10
About performance - according to pugixml's benchmark their performance is better even than RapidXML's. So I guess things might changed in the past 6 years (pugixml is now 1.8, released at 2016 while RapidXML didn't get any update since 2009). link: https://pugixml.org/benchmark.html — Ezra Steinmetz, Feb 20 '18 at 17:40
adding libhpxml to the list, a stream reader with a very simple and effective API. Handles OpenStreetMap files that can up in Terabyte size. https://www.abenteuerland.at/libhpxml/. A library that reads OSM data using libhpxml: https://github.com/pedro-vicente/lib_osm — Pedro Vicente, May 06 '19 at 05:02
An issue I have with the C non-reentrant approach of libxml is that it is not state-less. — Ben, Oct 10 '19 at 08:09
I think that it worth to mention that boost property tree has XML (it's where RapidXML comes from in the first place) - this is extremely easy to overlook for the untrained eye and some people may already have boost setup, etc. and not be aware of this (eg. sometimes you needs are pretty basic and using a tool you already have there is preferred). — darune, Dec 02 '19 at 08:48
Also, when feasable you probably want to use the property_tree library (the high level API) instead of RapidXML. So it should be property_tree/RapidXML or boost property_tree should have an entry of it's own in your list — darune, Dec 02 '19 at 10:14
any view on POCO xml, https://pocoproject.org/slides/170-XML.pdf? — TooTone, Sep 18 '20 at 10:57

score 21 · Answer 2 · edited Mar 25 '13 at 23:14

21

There is another approach to handling XML that you may want to consider, called XML data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.

XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).

See, for example, an open-source XML data binding implementation that I wrote, CodeSynthesis XSD and, for a lighter-weight, dependency-free version, CodeSynthesis XSD/e.

edited Mar 25 '13 at 23:14

JBentley

6,099
5
37
72

answered Feb 22 '12 at 13:41

Boris Kolpackov

674
5
6

14

I don't mind the post, but SO policy states that if you suggest something you wrote, you should mention that you wrote it, in the interest of full disclosure. – Nicol Bolas Mar 09 '12 at 17:10
@Nicol I edited it into the answer. – JBentley Mar 25 '13 at 23:18
Perhaps helpful is [this list](http://xmldatabinding.org) but I could not find out who the author(s) of that list are (without public disclosure I can't see if the descriptions and ratings are meaningful). Perhaps one can look at the [W3C data binding working group](https://www.w3.org/2002/ws/databinding/) that lists several **[data binding tools](http://www.w3.org/2002/ws/databinding/edcopy/toolkits)** which are in the public domain and were used for testing and reporting (full disclosure: I am not affiliated with CodeSynthesis, I've helped gsoap listed with the W3C tools). – Dr. Alex RE Feb 27 '16 at 18:17

Michael Haephrati · Answer 3 · 2017-01-10T15:48:51.907

In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.

Here is an example:

 rapidxml::xml_document<char> doc;
    doc.parse<0>(xmlData);
    rapidxml::xml_node<char>* root = doc.first_node();

    rapidxml::xml_node<char>* node_account = 0;
    if (GetNodeByElementName(root, "Account", &node_account) == true)
    {
        rapidxml::xml_node<char>* node_default = 0;
        if (GetNodeByElementName(node_account, "default", &node_default) == true)
        {
            swprintf(result, 100, L"%hs", node_default->value());
            free(xmlData);
            return true;
        }
    }
    free(xmlData);

score 1 · Answer 4 · answered Jun 07 '17 at 19:30

One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.

Victor Gubin · Answer 5 · 2021-09-14T19:35:49.157

1

Ok then. I've created new one, since none of the list wasn't statisfy my needs.

Benefits:

Pull parser Streaming API i.e. parser is like iterator no callback or DOM tree. I.e. reading XML to data structures
Exceptions and RTTI can be off by compiler options, error handling can be done over std::error_code
Limit for memory usage, support for large files (tested with 100 mib XMark file from, speed depends on hardware). There is an example for limited COLLADA format 3D model loading
UNICODE support, and auto-detecting for input source encoding

Project home

edited Sep 14 '21 at 19:35

answered Mar 07 '18 at 16:30

Victor Gubin

2,782
10
24

1

Could you add benchmarks? – Vadim Peretokin Apr 23 '18 at 08:01

score 0 · Answer 6 · answered Dec 24 '15 at 10:56

0

Put mine as well.

http://www.codeproject.com/Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML

No XML validation features, but fast.

answered Dec 24 '15 at 10:56

Michael Chourdakis

10,345
3
42
78

3

Is it faster or more widely used than RapidXML? Or PugiXML? The domain space for "fast, not-entirely-XML" C++ parser has been pretty well covered. – Nicol Bolas Jan 02 '16 at 13:33

What XML parser should I use in C++?

6 Answers6

I Need Full XML Compliance

I Need Exact DOM and/or SAX Conformance

I Don't Care About DOM and/or SAX Conformance

I Do Not Need Full XML Compliance

Maximum XML Parsing Performance

I Care About Performance But Not Quite That Much

Reading Huge Documents

Alternatives

XPath Support

Just Get The Job Done

Linked

Related