0

I am currently trying to learn how to parse XML and HTML. I was able to parse slickdeals XML feed with my current code, but when I attempt to parse the front page of the slickdeals I encountered an error

[Fatal Error] :102:23: The entity name must immediately follow the '&' in the entity reference. Exception in thread "main" org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)

public class SlickDealMainPage {

public void parsing() throws Exception{
    String url = "http://slickdeals.net/";
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new URL(url).openStream());
    doc.getDocumentElement().normalize();

    //System.out.println("Root Element : " + doc.getDocumentElement().getNodeName());

    System.out.println("Root Element : " + doc.getElementsByTagName("Body"));


    NodeList itemList = doc.getElementsByTagName("body");



   /* for(int temp = 0; temp < itemList.getLength(); temp++)
    {
        Node itemNode = itemList.item(temp);

        System.out.println("\nCurrent Element : " + itemNode.getNodeName());

        Element itemElement = (Element) itemNode;

        System.out.println("\ntitle : " + itemElement.getElementsByTagName("title").item(0).getTextContent());
        System.out.println("\nLink : " + itemElement.getElementsByTagName("link").item(0).getTextContent());
        System.out.println("\nDate Published: " + itemElement.getElementsByTagName("pubDate").item(0).getTextContent());
    }*/

}

}

I am new to using the DOM method for parsing and I have searched all over for an answer to this problem. However, I did really understand the other answers very well.

Edit: The error occurs at

    Document doc = db.parse(new URL(url).openStream());

Thank You for your help!

meepin
  • 65
  • 1
  • 1
  • 7
  • So what's at the lines the exception is specifying? Have you followed what the exception said was wrong? – BLaZuRE Jul 27 '13 at 22:18
  • yes, It told me the exception occurs at "Document doc = db.parse(new URL(url).openStream());" and it is a parsing error, which have to do with the '&' – meepin Jul 27 '13 at 22:26
  • 1
    You cannot really parse HTML using DOM. HTML on the web is usually non complaint and so it will explode. You need to use something like [jSoup](http://jsoup.org/) which is a lenient HTML parser - it will try and correct issues as it parses. – Boris the Spider Jul 27 '13 at 22:32
  • If you actually try and go to the lines referenced in the error (stated column 102, line 23), at line 24, column 102, you will see there is an ampersand symbol (&). HTML is not a subset of XML. – BLaZuRE Jul 27 '13 at 22:41
  • ahh ok. I was kind of thinking the same thing. I thought XHTML was a combination of both xml and html? http://www.cs.nmsu.edu/~epontell/courses/XML/material/xmlparsers.html#q14 From the source page i can see that slickdeals is using XHTML right? – meepin Jul 27 '13 at 22:43
  • @user2592708 - Wrong. It may be including an XHTML doctype and trying in some places to follow XHTML syntax rules, but it's really HTML. You should read [this](http://hixie.ch/advocacy/xhtml) and/or [this](http://stackoverflow.com/questions/2662508/html-4-html-5-xhtml-mime-types-the-definitive-resource/2664082#2664082). – Alohci Jul 28 '13 at 00:17

0 Answers0