Parsing HTML files in C - alternatives to libxml2

Question

So I want to create a web crawler in C. There are hardly any libraries to support this.
I can use libtidy to convert HTML to XHTML and get the HTML files using libcurl (which has decent documentation).

My problem is parsing the HTML files and getting all the links present in it. I know libxml2 is there but its extremely hard to understand because there is no good documentation for its API.

Should I even do this in C or go with another language like Java ? Or are there any good alternatives to libxml2 ?

possible duplicate of [XML Parser for C](http://stackoverflow.com/questions/399704/xml-parser-for-c) — Karthik T, Jan 19 '13 at 16:55
[No good documentation for libxml2?](http://www.xmlsoft.org/html/index.html) — , Jan 19 '13 at 16:55
You can't rely upon XML compliance when crawling the web. So XML parsers won't help much — Paolo, Jan 19 '13 at 16:56
You could write your own crawler code, if you want a very customized solution. :) — askmish, Jan 19 '13 at 17:08
HTML is **not** xml or even related to xml so an xml parser is useless. — R.. GitHub STOP HELPING ICE, Jan 19 '13 at 17:29
If `libtidy` converts HTML to XHTML, then an XML parser might be usable on the output. Whether that's sufficient, efficient or sensible is another matter. — Jonathan Leffler, Jan 19 '13 at 18:52
You can also look at [expat](http://www.jclark.com/xml/expat.html), Source - http://stackoverflow.com/questions/399704/xml-parser-for-c. Also take a look at http://stackoverflow.com/questions/1527883/parse-html-using-c — Karthik T, Jan 19 '13 at 16:54

Paolo · Accepted Answer · 2014-04-09T20:22:30.760

1

Parsing HTML requires basically just string manipulation.

But it's quite hard to do without an HTML or XML (if it's XHTML) parser.

As for the second part of the question I woudn't choose C for such task because even basic string operations are much complex than many other languages that support them natively.

I would go for a scripting lanuguage such Python, JavaScript, PHP...

Instead of using libcurl you'll invoke curl as a command line tool.

Btw: libcurl documentation is very good (in my opinion).

edited Apr 09 '14 at 20:22

answered Jan 19 '13 at 17:12

Paolo

15,233
27
70
91

Yes libcurl documentation is really good. I don't want to invoke curl from system() because it will invoke another shell process which I do not want. I will look into regex :) Thank you – Deepankar Bajpeyi Jan 19 '13 at 17:15
You can not parse HTML with regex. Regular expressions are not powerful enough to handle HTML, which is a context-free grammar. See [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and [this](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/). – antimatter Apr 08 '14 at 09:31
1

Changing your answer to remove your original suggestion that you should use regex is all well, but downvoting half a dozen of my answers because you're upset shows that you're dishonest and you lack integrity. – antimatter Apr 10 '14 at 00:21

Parsing HTML files in C - alternatives to libxml2

1 Answers1