1

So I want to create a web crawler in C. There are hardly any libraries to support this.
I can use libtidy to convert HTML to XHTML and get the HTML files using libcurl (which has decent documentation).

My problem is parsing the HTML files and getting all the links present in it. I know libxml2 is there but its extremely hard to understand because there is no good documentation for its API.

Should I even do this in C or go with another language like Java ? Or are there any good alternatives to libxml2 ?

Deepankar Bajpeyi
  • 5,661
  • 11
  • 44
  • 64

1 Answers1

1

Parsing HTML requires basically just string manipulation.

But it's quite hard to do without an HTML or XML (if it's XHTML) parser.

As for the second part of the question I woudn't choose C for such task because even basic string operations are much complex than many other languages that support them natively.

I would go for a scripting lanuguage such Python, JavaScript, PHP...

Instead of using libcurl you'll invoke curl as a command line tool.

Btw: libcurl documentation is very good (in my opinion).

Paolo
  • 15,233
  • 27
  • 70
  • 91
  • Yes libcurl documentation is really good. I don't want to invoke curl from system() because it will invoke another shell process which I do not want. I will look into regex :) Thank you – Deepankar Bajpeyi Jan 19 '13 at 17:15
  • You can not parse HTML with regex. Regular expressions are not powerful enough to handle HTML, which is a context-free grammar. See [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and [this](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/). – antimatter Apr 08 '14 at 09:31
  • 1
    Changing your answer to remove your original suggestion that you should use regex is all well, but downvoting half a dozen of my answers because you're upset shows that you're dishonest and you lack integrity. – antimatter Apr 10 '14 at 00:21