19

I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?

[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]

Writwick
  • 2,133
  • 6
  • 23
  • 54
  • There are [really good XML parsers](http://stackoverflow.com/questions/170686) out there, but I am not aware of a good C++ HTML specific parser – nikolas Jul 29 '13 at 10:46
  • Would JNI be a solution for you? – suspectus Jul 29 '13 at 10:50
  • I might not want to use JNI. I have not much idea about it. And also I want to make the project less dependent[except necessary]. – Writwick Jul 29 '13 at 12:20
  • And also for a clarification, what I need is just parsing the document and get some values from it and a reliable method to return using CSS Selector[preferably] or Xpath. Also, it would be very good if the parser is very fast, since I would be browsing over 100,000 pages to maintain a database. – Writwick Jul 29 '13 at 12:24
  • I don't know how things compare to Jsoup, but see [Comparison of HTML parsers](http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers) for a detailed list of parsers in various languages. – jww Mar 11 '14 at 06:26
  • 2
    Google open sourced Gumbo: https://github.com/google/gumbo-parser – CC. Jun 05 '14 at 22:01

6 Answers6

13

Unfortunately, i guess there's no parser like Jsoup for C++ ...

Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries

For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).

LibXml

  • push and pull parser (DOM, SAX)
  • Validation
  • XPath and XPointer support
  • Cross-Plattform / good documentation

Apache Xerxces

  • push and pull parser (DOM, SAX)
  • Validation
  • No XPath support (but a package for this?)
  • Cross-Plattform / good documentation

If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.

Some more:

Maybe you can combine a DOM Model / Parser and a CSS selector together?

ollo
  • 24,797
  • 14
  • 106
  • 155
  • I did not even think of only a CSS Selector!!! [How fool of me!!! I get the source of the page by cURL and CSS selector will do the rest!!!]. Thanks for pointing that out. – Writwick Aug 07 '13 at 17:51
  • LibDOM is not C++ compatible as it uses the keyword namespace as a member variable of a structure. – Czipperz Jan 26 '17 at 06:58
9

If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).

Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.

huysentruitw
  • 27,376
  • 9
  • 90
  • 133
sgun
  • 899
  • 6
  • 12
7

Chromium has an open source parser. Also, the Google gumbo-parser looks cool.

3

Yes, there is a html parser lib for c++, check it out https://github.com/HamedMasafi/HtmlParser/

This library can parse html or css and convert it to a tree model. You can search in parsed html by methods like: get_by_id, get_by_class_name, get_by_tag_name, and also there is a question method that you can search via css selector (only tag, id, class, nested childs selectors supported for now).

After finding a child you can change it's attributes and in final you can print a html into std::string in compact and pretty mode.

Hamed Masafi
  • 304
  • 1
  • 8
  • 2
    Just linking to your own library or tutorial is not a good answer. Linking to it, explaining why it solves the problem, providing code on how to do so and disclaiming that you wrote it makes for a better answer. See: [**What signifies “Good” self promotion?**](//meta.stackexchange.com/q/182212) – Suraj Rao Feb 06 '19 at 10:34
  • OK, thanks for comment. – Hamed Masafi Feb 06 '19 at 14:23
1

You can use xerces2 as DOM parser.

Or use HTML Tidy to clean up the HTML and convert it to XHTML then parse the XML with pugixml or similar XML parser. And since pugixml is a non-validating parser, it might as well work on the raw HTML without the need of runnin HTML Tidy on it first.

huysentruitw
  • 27,376
  • 9
  • 90
  • 133
1

If you don't mind calling out to python from C++, you could use Beautiful Soup. At least the name is right!

Seriously - it's a nice, no-nonsense HTML parser. I haven't tried calling out to it from C++, although it should be straightforwards.

Graham Griffiths
  • 2,196
  • 1
  • 12
  • 15