How to get a clean xml representation from a website url

Question

I'm trying to get a clean representation of a website url, so I can put the 'html' inside a

org.w3c.dom.Document

to be able to do further processing with xpath and so on.

What I get, when I try to put the html inside a document is :

org.xml.sax.SAXParseException : Elementtyp "link" muss mit dem entsprechenden Endtag "" beendet werden

which means, that "link" has to be closed, what isn't the case in this website.

So, could be the right approach ? Should I 'fix' the document and replace errors ?

I tried net.sourceforge.htmlcleaner but it didn't figure out, how to 'fix' the errors.

Any help ?

Regards, Holger

It depends what the HTML cleaner does to the HTML. Valid HTML is not necessarily valid XML - http://stackoverflow.com/questions/10473875/converting-html-to-xml. — Paul Grime, Apr 11 '13 at 09:11

score 1 · Answer 1 · answered Apr 11 '13 at 09:11

1

Works terribly well for me

answered Apr 11 '13 at 09:11

Guillaume Serre

score 0 · Accepted Answer · answered Apr 11 '13 at 09:13

0

HTML is usually not xml, so Document can not process it. You need a special library like JSoup

answered Apr 11 '13 at 09:13

Denis Tulskiy

1

I don't think JSoup generates org.w3c.dom compliant Documents, they are not parsable with XPath. – Guillaume Serre Apr 11 '13 at 09:31
@GuillaumeSerre: yes, but jsoup supports jquery-like selectors which may be more suitable for working with html. – Denis Tulskiy Apr 11 '13 at 10:15
I'm just trying it out, and it's pretty nice. – ITR Apr 11 '13 at 10:35

2 Answers2