Clean and convert HTML to XML for BaseX

Question

I would like to run some XQuery commands using BaseX over an HTML source that may be full of <script>, <style> nodes that must be removed and also unclosed tags (<br>, <img>) that must have a pair. (for example the dirty source of this page )

"Converting HTML to XML" suggests using Tidy, but it doesn't have a GUI and doesn't seem work correctly on my source (it outputs nothing), and I doubt if it removes scripts and other unnecessary tags. It is very old, by the way.

As I didn't find any question which address my needs, I asked it again. because it is very close to the tools for coding and querying, I asked it here.

At the close voters: I don't see how this question searches for product recommendations nor requires requires any code to reproduce the issue. — Jens Erat, Jun 14 '15 at 21:20

Jens Erat · Accepted Answer · 2015-06-14T21:18:09.630

BaseX has integration for TagSoup, which will convert HTML to well-formed XHTML.

Most distributions of BaseX already bundle TagSoup, if you installed BaseX from a Linux repository, you might need to add it manually (for example, on Debian and Ubuntu it's called libtagsoup-java). Further details for different installation options are given in the documentation linked above.

Afterwards, either set the TagSoup parser as default using the command

SET PARSER html

or in the XQuery prologue using

declare option db:parser "html";

Afterwards, simply fetch the document you want. An example for the Amazon site you linked:

declare option db:parser "html";
doc('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&amp;field-keywords=camera')

This should work, but doesn't. I'm querying the main developers for the reason it doesn't (seems because of some an HTTP redirection) and will update the answer when the issue is resolved (or I understand why this does not work). Workaround until then is to fetch the document as text and parse it as HTML:

html:parse(fetch:text('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&amp;field-keywords=camera')

I think the problem is due to the fetch being blocked by Amazon: $ curl -I 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=camera' -- returns --> HTTP/1.1 405 Method Not Allowed ; maybe it will work with the right User-Agent — Steven D. Majewski, Jun 14 '15 at 22:20
I'm getting valid results using `curl -L`. They might have blocked you for excessive queries. Maybe you should consider using their API, anyway -- this should always be preferred from site scrping, anyway. — Jens Erat, Jun 14 '15 at 22:25
Thank you very much! I am going to install it for baseX in Windows — Ahmad, Jun 15 '15 at 05:14

Clean and convert HTML to XML for BaseX

1 Answers1