Jsoup like html parser for C++

Question

I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?

[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]

There are [really good XML parsers](http://stackoverflow.com/questions/170686) out there, but I am not aware of a good C++ HTML specific parser — nikolas, Jul 29 '13 at 10:46
I might not want to use JNI. I have not much idea about it. And also I want to make the project less dependent[except necessary]. — Writwick, Jul 29 '13 at 12:20
And also for a clarification, what I need is just parsing the document and get some values from it and a reliable method to return using CSS Selector[preferably] or Xpath. Also, it would be very good if the parser is very fast, since I would be browsing over 100,000 pages to maintain a database. — Writwick, Jul 29 '13 at 12:24
I don't know how things compare to Jsoup, but see [Comparison of HTML parsers](http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers) for a detailed list of parsers in various languages. — jww, Mar 11 '14 at 06:26
Google open sourced Gumbo: https://github.com/google/gumbo-parser — CC., Jun 05 '14 at 22:01

ollo · Accepted Answer · 2013-08-07T13:57:09.747

Unfortunately, i guess there's no parser like Jsoup for C++ ...

Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries

For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).

LibXml

push and pull parser (DOM, SAX)
Validation
XPath and XPointer support
Cross-Plattform / good documentation

Apache Xerxces

push and pull parser (DOM, SAX)
Validation
No XPath support (but a package for this?)
Cross-Plattform / good documentation

If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.

Some more:

htmlcxx - html and css APIs for C++
MSHTML (?)
pugixml (DOM / XPath and Unicode support)
LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
hcxselect (CSS selector engine for C++)

Maybe you can combine a DOM Model / Parser and a CSS selector together?

I did not even think of only a CSS Selector!!! [How fool of me!!! I get the source of the page by cURL and CSS selector will do the rest!!!]. Thanks for pointing that out. — Writwick, Aug 07 '13 at 17:51
LibDOM is not C++ compatible as it uses the keyword namespace as a member variable of a structure. — Czipperz, Jan 26 '17 at 06:58

score 9 · Answer 2 · edited Aug 07 '13 at 11:54

9

If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).

Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.

edited Aug 07 '13 at 11:54

huysentruitw

27,376
9
90
133

answered Aug 04 '13 at 20:09

sgun

899
6
12

+1 for mentioning QWebElement. Didn't even know it exists. – huysentruitw Aug 07 '13 at 11:55
I did not knew that either. :D +1. It's really the simplest solution but also the dependency to Qt is a problem [for me]... – Writwick Aug 07 '13 at 17:47

score 7 · Answer 3 · 2014-12-27T04:41:02.510

7

Chromium has an open source parser. Also, the Google gumbo-parser looks cool.

edited Dec 27 '14 at 04:41

answered Dec 27 '14 at 00:24

1

Much appreciated. Let's hope it floats up to the top voted answers – sehe Mar 06 '16 at 03:37

Hamed Masafi · Answer 4 · 2019-02-06T14:27:50.400

3

Yes, there is a html parser lib for c++, check it out https://github.com/HamedMasafi/HtmlParser/

This library can parse html or css and convert it to a tree model. You can search in parsed html by methods like: get_by_id, get_by_class_name, get_by_tag_name, and also there is a question method that you can search via css selector (only tag, id, class, nested childs selectors supported for now).

After finding a child you can change it's attributes and in final you can print a html into std::string in compact and pretty mode.

edited Feb 06 '19 at 14:27

answered Feb 06 '19 at 09:53

Hamed Masafi

304
1
8

2

Just linking to your own library or tutorial is not a good answer. Linking to it, explaining why it solves the problem, providing code on how to do so and disclaiming that you wrote it makes for a better answer. See: [**What signifies “Good” self promotion?**](//meta.stackexchange.com/q/182212) – Suraj Rao Feb 06 '19 at 10:34
OK, thanks for comment. – Hamed Masafi Feb 06 '19 at 14:23

score 1 · Answer 5 · answered Jul 31 '13 at 18:41

You can use xerces2 as DOM parser.

Or use HTML Tidy to clean up the HTML and convert it to XHTML then parse the XML with pugixml or similar XML parser. And since pugixml is a non-validating parser, it might as well work on the raw HTML without the need of runnin HTML Tidy on it first.

score 1 · Answer 6 · answered Aug 07 '13 at 15:36

1

If you don't mind calling out to python from C++, you could use Beautiful Soup. At least the name is right!

Seriously - it's a nice, no-nonsense HTML parser. I haven't tried calling out to it from C++, although it should be straightforwards.

answered Aug 07 '13 at 15:36

Graham Griffiths

2,196
1
12
15

Jsoup like html parser for C++

6 Answers6

Linked