0

I am scraping data in a XML using JAVA with DOM package. I was able to retrieve the needed information, but I'm having a problem when in the XML there are any &nbsp tags.

This is my feed.xml file:

<inventory>
    <item UnitID="1234" Record="0">
        <id>1234</id>
        <dealerid>455</dealerid>
        <stock_number>1600Xtreme</stock_number>

        <details>This is some additional details &nbsp about the 
        product</details>

        <make>Nvidia</make>                       
    </item>
    <item UnitID="7854" Record="1">
        <id>7854</id>
        <dealerid>587</dealerid>
        <stock_number>12TMAX5500</stock_number>

        <details>This is some additional details &nbsp about the 
        product</details>

        <make>Realtek</make> 
    </item>
</inventory>

As you can see in the feed.xml, the details Tag contains a &nbsp , and whenever I run my JAVA it displays an error.

However, if I remove that line, everything works fine. Removing it is not an option, since I'm not allowed to edit the xml in real life.

This is my JAVA code:

File fXmlFile=new File("feed.xml");
DocumentBuilderFactory dbFactory=DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder=dbFactory.newDocumentBuilder();
Document doc=dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
NodeList nList=doc.getElementsByTagName("item");
for (int temp=0; temp < nList.getLength(); temp++)
{
    Node nNode=nList.item(temp);
    Element eElement2 = (Element)nNode;
    String search="Nvidia";
if (eElement2.getElementsByTagName("make").item(0).
getTextContent().equals(search))
    {
        System.out.println("The condition on the IF is True");
    }
}

This is the error I get when run:

[Fatal Error] feed.xml:150:504: The entity "nbsp" was referenced, but not declared. org.xml.sax.SAXParseException; systemId: file:/C:/src/Test1/feedForTests.xml; lineNumber: 150; columnNumber: 504; The entity "nbsp" was referenced, but not declared. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:205) at Test1.ReadXMLFile2.main(ReadXMLFile2.java:58)

Just by removing the &nbsp from the details tag, the problem disapears.

I have got to this point with my code, but got stuck and cant find a solution. I appreciate your help.

David G.
  • 25
  • 9
  • Plenty of options to solve your issue, look here: http://stackoverflow.com/questions/36026353/parsing-xml-file-containing-html-entities-in-java-without-changing-the-xml/36097922#36097922 – Ivan Pronin May 12 '17 at 06:17

1 Answers1

2

Your "XML" isn't XML, because it contains an unresolved entity reference. (In fact it's not even a well-formed entity reference because it lacks the terminating semicolon.)

So you're in the position of a lot of SO users: you've been sent bad data. My advice is, send it back to where it came from and ask for your money back. Don't accept shoddy goods. The whole point of XML is to reduce costs by using a widely implemented standard, and if people send you stuff that isn't XML then you get none of those benefits.

You can mend it, of course, but there's no reason why you and I should bear the costs incurred because of a data provider who doesn't care about quality.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164