0

Can anyone advise me a library for Java that allows me to perform an XPath Query over an html page?

I tried using JAXP but it keeps giving me a strange error that I cannot seem to fix (thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Thank you very much.

EDIT

I found this:

// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();

// Turn on validation
factory.setValidating(true);

// Create a validating SAX parser instance
SAXParser parser = factory.newSAXParser();

// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

// Turn on validation
factory.setValidating(true);

// Create a validating DOM parser
DocumentBuilder builder = factory.newDocumentBuilder();

from http://www.ibm.com/developerworks/xml/library/x-jaxpval.html But turning the argumrent to false did not change anything.

Leonardo Marques
  • 3,721
  • 7
  • 36
  • 50
  • Several related questions - see http://stackoverflow.com/questions/9766776/extract-content-using-xpath-from-an-html-doc-using-pure-java http://stackoverflow.com/questions/3361263/library-to-query-html-with-xpath-in-java http://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java – Mark Butler Jan 07 '13 at 00:39

2 Answers2

1

Setting the parser to "non validating" just turns off validation; it does not inhibit fetching of DTD's. Fetching of DTD is needed not just for validation, but also for entity expansion... as far as I recall.

If you want to suppress fetching of DTD's, you need to register a proper EntityResolver to the DocumentBuilderFactory or DocumentBuilder. Implement the EntityResolver's resolveEntity method to always return an empty string.

Isaac
  • 16,458
  • 5
  • 57
  • 81
0

Take a look at this:

http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

Probably you have the parser set to perform DOM validation, and it is trying to retrieve the DTD. JAXP should have a way to disable DTD validation, and just run XPATH against a document assumed to be valid. I haven't used JAXP is many years so I'm sorry I couldn't be more helpful.

Java Drinker
  • 3,127
  • 1
  • 21
  • 19