0

I'm trying to get a clean representation of a website url, so I can put the 'html' inside a

org.w3c.dom.Document

to be able to do further processing with xpath and so on.

What I get, when I try to put the html inside a document is :

org.xml.sax.SAXParseException : Elementtyp "link" muss mit dem entsprechenden Endtag "" beendet werden

which means, that "link" has to be closed, what isn't the case in this website.

So, could be the right approach ? Should I 'fix' the document and replace errors ?

I tried net.sourceforge.htmlcleaner but it didn't figure out, how to 'fix' the errors.

Any help ?

Regards, Holger

ITR
  • 53
  • 1
  • 10
  • 1
    It depends what the HTML cleaner does to the HTML. Valid HTML is not necessarily valid XML - http://stackoverflow.com/questions/10473875/converting-html-to-xml. – Paul Grime Apr 11 '13 at 09:11

2 Answers2

1

You can have a look at Neko : http://nekohtml.sourceforge.net/

Works terribly well for me

Guillaume Serre
  • 307
  • 2
  • 8
0

HTML is usually not xml, so Document can not process it. You need a special library like JSoup

Denis Tulskiy
  • 19,012
  • 6
  • 50
  • 68