I'm trying to write a small program to extract some data from a web page using libxml2. Since the data is local in an HTML file, I decided to use the following as a starting point to get the HTML into a traversable memory structure:
int main(int argc, char* argv[])
{
htmlDocPtr dp = htmlReadFile(argv[1], NULL, HTML_PARSE_RECOVER | HTML_PARSE_NONET );
However, when I run this passing the HTML file as a parameter, I get an error:
HTML parser error : htmlParseEntityRef: expecting ';'
What it seems to be complaining about is the following:
<a href="do_something.html?a=1&b=2"> some stuff </a>
i.e. rather than ignore the contents of the href
attribute or treat it as a URL with parameters, it seems to be treating the bit from &b
as an entity reference like &name; and complaining that there's no semicolon. Surely that's not right? Should I be doing something different to get it to ignore this (I'm not interested in these tags in any case) or have I just missed the point somehow?