2

I'm trying to write a small program to extract some data from a web page using libxml2. Since the data is local in an HTML file, I decided to use the following as a starting point to get the HTML into a traversable memory structure:

int main(int argc, char* argv[])
{
    htmlDocPtr dp = htmlReadFile(argv[1], NULL, HTML_PARSE_RECOVER | HTML_PARSE_NONET );

However, when I run this passing the HTML file as a parameter, I get an error:

HTML parser error : htmlParseEntityRef: expecting ';'

What it seems to be complaining about is the following:

<a href="do_something.html?a=1&b=2"> some stuff </a>

i.e. rather than ignore the contents of the href attribute or treat it as a URL with parameters, it seems to be treating the bit from &b as an entity reference like &name; and complaining that there's no semicolon. Surely that's not right? Should I be doing something different to get it to ignore this (I'm not interested in these tags in any case) or have I just missed the point somehow?

Component 10
  • 10,247
  • 7
  • 47
  • 64

1 Answers1

1

Your input file is invalid, because it contains invalid url. See point 2.2 in RFC 3986 - Reserved characters. Ampersand should be escaped using percent sign escapes, and question mark too. A legal url would look like this:

<a href="do_something.html%3Fa%3D1%26b%3D2"> some stuff </a>

But it's only one of a long list of traps when trying to parse html. The usual approach is to use a tidying library, see this question: Parse html using C. The goal is to clear errors in html before actual parsing.

Community
  • 1
  • 1
Jarekczek
  • 7,456
  • 3
  • 46
  • 66
  • Yes, you're right. this isn't the first error in this (supposedly xhtml conformant) page. Thanks very much for the links - I'll follow them up. – Component 10 Nov 11 '12 at 23:36