0

I've been trying to parse some html from my C++ code. I've tried RapidXML, TinyXML and Xerces. The first two gave me parsing errors (the code I'm trying to parse is broken: some <> aren't closed) while Xerces returned null when after I called getDocumentRoot().

How to proceed in theses cases when you have to parse broken code? Are there some libraries for that kind of problems?

Mat
  • 202,337
  • 40
  • 393
  • 406
  • It would help if you posted any relevant code, whether it's c++ or XML. – pg1989 Mar 25 '12 at 17:50
  • 4
    Maybe try using an HTML parser instead? – Mat Mar 25 '12 at 17:52
  • How to proceed depend on what you want to happen. So the XML is invalid. How do you want to fix it? You can't expect the parser to fix that would make too many assumptions so you need to specify what you want to happen when you find a broken document. – Martin York Mar 25 '12 at 18:48
  • 1
    HTML is not XML. Never has been, never will be. You will never get an off the shelf XML parser to correctly parse HTML. XHTML, on the other hand, does conform to the XML standard and can be parsed by any semi decent XML parser. –  Mar 25 '12 at 18:49
  • Well, in the end i want to parse a file, modify some attributes & content, and save it in another file. I've just tried to parse it with htmlcxx. I have no error while parsing it but i'm not able to save it back in a file. –  Mar 25 '12 at 22:07
  • Well, it's not C++, rather Python, but [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) was _made_ for this. – cha0site Mar 26 '12 at 09:38
  • i just tried it, and it seems to be working for what i need. But i don't have a python environment where i need the script to run. Isn't there a similar lib for c++ of Beautiful Soup ? –  Mar 26 '12 at 13:08

3 Answers3

1

xerces-c uses exceptions like many others.

If you want to have a robust xml parser, make heavy use of catching thrown exceptions. Many exception classes have additional information, so you can use them to make a really robust and "tolerant" xml parser.

SAX is also a good starting point.

Example DOM parser in xerces-c (my favorite parser):

XercesDOMParser* parser = new XercesDOMParser();
parser->setValidationScheme(XercesDOMParser::Val_Always);
parser->setDoNamespaces(true);

ErrorHandler* errHandler = (ErrorHandler*) new HandlerBase();
parser->setErrorHandler(errHandler);

char* xmlFile = "test.xml";

try
{
     parser->parse(xmlFile);
}
catch (const XMLException& toCatch)
{
     /*ERROR HANDLER*/
}
catch (const DOMException& toCatch)
{
     /*ERROR HANDLER*/
}
catch (...)
{
     /*ERROR HANDLER*/
}

delete parser;
delete errHandler;

Additionally, you can also create your own DOMErrorHandler to make "corrections" on the fly. See the xerces-c programming guide for more information.

Kevin Bedell
  • 13,254
  • 10
  • 78
  • 114
pearcoding
  • 1,149
  • 1
  • 9
  • 28
0

Have you tried this one? I've found one of the most simple and efficient xml parser for c++... Maybe it can help you solve your problem.

Jorge Leitao
  • 19,085
  • 19
  • 85
  • 121
0

First off, if the XML is broken (as HTML generally is) then using a DOM parser is definitely not the way to do. If you use an event based parser like SAX (like expat, Xerces etc), you might have better luck.

Failing that, why not pull the HTML parser out of Webkit and hook into that. It will be very error tolerant and if I remember correctly, it is event based so that should not be too difficult.

doron
  • 27,972
  • 12
  • 65
  • 103