Parsing XML file containing HTML entities in Java without changing the XML

Question

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.

Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.

I'd like to use:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = dbf.newDocumentBuilder();
Document        doc    = parser.parse( stream );

I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Here's a full example:

public class Main {
    public static void main( String [] args ) throws Exception {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = dbf.newDocumentBuilder();
        Document        doc    = parser.parse( new FileInputStream( "test.xml" ));
    }

}

with test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Produces:

[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.

Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?

They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

can you provide some data samples? it's mix of xml and html at the same time? — Eugene Lebedev, Mar 16 '16 at 05:32
@jtahlborn: that callback does not seem to get invoked; I set a breakpoint there and it never gets hit. — Johannes Ernst, Mar 16 '16 at 17:24
I have used JSoup api to parse HTML files. It is a open source and has various utility methods required to parse HTML. — Rahul, Mar 24 '16 at 15:32

applecrusher · Accepted Answer · 2016-03-23T17:57:42.810

11

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

public static void main(String args[]){


    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

Result:

<bar>
 Some&nbsp;text — invalid!
</bar>

Loading from a file can be found here:

http://jsoup.org/cookbook/input/load-document-from-file

edited Mar 23 '16 at 17:57

answered Mar 23 '16 at 17:52

applecrusher

5,508
5
39
89

1

This answer is not entirely what I had in mind (use the JDK's DocumentBuilderFactory etc) but it seems the closest actually viable approach. So I'll mark this as the accepted answer and award the bounty. – Johannes Ernst Mar 25 '16 at 18:48

score 8 · Answer 2 · edited May 23 '17 at 12:02

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —

XML has only five predefined entities. The —,   is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)

Issue - 2: I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Streaming API for XML, called StaX, is an API for reading and writing XML Documents.

StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.

The core StaX API falls into two categories and they are listed below. They are

Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.

STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:

Requires the parser to replace internal entity references with their replacement text and report them as characters

This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.

However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.

You may try it. Hope it will solve your issue. For your case,

Main.java

import java.io.FileInputStream;
import java.io.FileNotFoundException;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;

public class Main {

    public static void main(String[] args) {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        inputFactory.setProperty(
                XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
        XMLEventReader reader;
        try {
            reader = inputFactory
                    .createXMLEventReader(new FileInputStream("F://test.xml"));
            while (reader.hasNext()) {
                XMLEvent event = reader.nextEvent();
                if (event.isEntityReference()) {
                    EntityReference ref = (EntityReference) event;
                    System.out.println("Entity Reference: " + ref.getName());
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (XMLStreamException e) {
            e.printStackTrace();
        }
    }
}

test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Output:

Entity Reference: nbsp

Entity Reference: mdash

Credit goes to @skaffman.

Related Link:

UPDATE:

Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them with something else, for example) and still produce a Document at the end of the process?

To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.

There are 5 methods of XMLStreamWriter for document.

xmlsw.writeStartDocument(); - initialises an empty document to which elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute name with the corresponding value to the last element produced by a call to writeStartElement. It is possible to add attributes as long as no call to writeElementStart,writeCharacters or writeEndElement has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with content s as content of the last started element.

A sample example is attached with it:

StAXExpand.java

import  java.io.BufferedReader;
import  java.io.FileReader;
import  java.io.IOException;

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import java.util.Arrays;

public class StAXExpand {   
    static XMLStreamWriter xmlsw = null;
    public static void main(String[] argv) {
        try {
            xmlsw = XMLOutputFactory.newInstance()
                          .createXMLStreamWriter(System.out);
            CompactTokenizer tok = new CompactTokenizer(
                          new FileReader(argv[0]));

            String rootName = "dummyRoot";
            // ignore everything preceding the word before the first "["
            while(!tok.nextToken().equals("[")){
                rootName=tok.getToken();
            }
            // start creating new document
            xmlsw.writeStartDocument();
            ignorableSpacing(0);
            xmlsw.writeStartElement(rootName);
            expand(tok,3);
            ignorableSpacing(0);
            xmlsw.writeEndDocument();

            xmlsw.flush();
            xmlsw.close();
        } catch (XMLStreamException e){
            System.out.println(e.getMessage());
        } catch (IOException ex) {
            System.out.println("IOException"+ex);
            ex.printStackTrace();
        }
    }

    public static void expand(CompactTokenizer tok, int indent) 
        throws IOException,XMLStreamException {
        tok.skip("["); 
        while(tok.getToken().equals("@")) {// add attributes
            String attName = tok.nextToken();
            tok.nextToken();
            xmlsw.writeAttribute(attName,tok.skip("["));
            tok.nextToken();
            tok.skip("]");
        }
        boolean lastWasElement=true; // for controlling the output of newlines 
        while(!tok.getToken().equals("]")){ // process content 
            String s = tok.getToken().trim();
            tok.nextToken();
            if(tok.getToken().equals("[")){
                if(lastWasElement)ignorableSpacing(indent);
                xmlsw.writeStartElement(s);
                expand(tok,indent+3);
                lastWasElement=true;
            } else {
                xmlsw.writeCharacters(s);
                lastWasElement=false;
            }
        }
        tok.skip("]");
        if(lastWasElement)ignorableSpacing(indent-3);
        xmlsw.writeEndElement();
   }

    private static char[] blanks = "\n".toCharArray();
    private static void ignorableSpacing(int nb) 
        throws XMLStreamException {
        if(nb>blanks.length){// extend the length of space array 
            blanks = new char[nb+1];
            blanks[0]='\n';
            Arrays.fill(blanks,1,blanks.length,' ');
        }
        xmlsw.writeCharacters(blanks, 0, nb+1);
    }

}

CompactTokenizer.java

import  java.io.Reader;
import  java.io.IOException;
import  java.io.StreamTokenizer;

public class CompactTokenizer {
    private StreamTokenizer st;

    CompactTokenizer(Reader r){
        st = new StreamTokenizer(r);
        st.resetSyntax(); // remove parsing of numbers...
        st.wordChars('\u0000','\u00FF'); // everything is part of a word
                                         // except the following...
        st.ordinaryChar('\n');
        st.ordinaryChar('[');
        st.ordinaryChar(']');
        st.ordinaryChar('@');
    }

    public String nextToken() throws IOException{
        st.nextToken();
        while(st.ttype=='\n'|| 
              (st.ttype==StreamTokenizer.TT_WORD && 
               st.sval.trim().length()==0))
            st.nextToken();
        return getToken();
    }

    public String getToken(){
        return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
    }

    public String skip(String sym) throws IOException {
        if(getToken().equals(sym))
            return nextToken();
        else
            throw new IllegalArgumentException("skip: "+sym+" expected but"+ 
                                               sym +" found ");
    }
}

For more, you can follow the tutorial

My existing code works on Document rather than events. Is there a way to use StaX to "filter" the entities (replacing them with something else, for example) and still produce a Document at the end of the process, so I don't have to redo all my code? (and preferably without parsing XML twice) — Johannes Ernst, Mar 20 '16 at 04:36
@JohannesErnst StAX provides a filter interface that allows programmers to hide unnecessary document detail from the application's business logic. For more I have updated the answer. Please take an overview. — SkyWalker, Mar 20 '16 at 14:46
Where does CompactTokenizer come from? I was expecting it to use StaX as in your earlier fragment. — Johannes Ernst, Mar 20 '16 at 23:43
@JohannesErnst I have added a CompactTokenizer.java as a sample. Please check and prepare as your need. — SkyWalker, Mar 21 '16 at 02:20
I appreciate all your work, but you lost me. I cannot see how I could construct the forest I'm looking for from the many trees you provided. — Johannes Ernst, Mar 24 '16 at 03:46
@JohannesErnst Sorry for making more pedantic. You can go through StaX for your issue. Hope this may solve your problem — SkyWalker, Mar 24 '16 at 04:31
I'd [appreciate](http://meta.stackexchange.com/questions/160077/users-are-calling-me-a-plagiarist-what-do-i-do) if you cite the source where you got the answer for issue 1 from. Doing off as if it are your own words is very rude. — BalusC, Mar 25 '16 at 14:19
@BalusC First I want to apologize to talk with you, one of my senior BOSS(I follow). Actually I want to say that I am a core learner. For learning purpose, I have accumulated some answers. I have added related link and given some credit(10 links are there). But further times, I will be careful to specify the specific person or their link. — SkyWalker, Mar 25 '16 at 17:15
The exact sentence is copypasted from [this post](http://stackoverflow.com/questions/13012327/error-parsing-page-xhtml-error-tracedline-42-the-entity-nbsp-was-referenc/13012488#13012488) which is not mentioned anywhere. And citations better go in citation blocks. — BalusC, Mar 27 '16 at 15:26
@BalusC Thanks for your complement. I have learnt a lot from you. I am grateful to you for your advice, comments and making me more proactive. — SkyWalker, Mar 27 '16 at 16:19

score 3 · Answer 3 · answered Mar 23 '16 at 17:18

3

Another approach, since you're not using a rigid OXM approach anyway. You might want to try using a less rigid parser such as JSoup? This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

answered Mar 23 '16 at 17:18

Richard

1,070
9
22

score 1 · Answer 4 · answered Mar 18 '16 at 19:08

1

Just to throw in a different approach to a solution:

You might envelope your input stream with a stream inplementation that replaces the entities by something legal.

While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

answered Mar 18 '16 at 19:08

rpy

3,953
2
20
31

Indeed a hack :-) How would I deal with character sets? To look for &...;, for example, I would have to know the charset, but the XML file only specifies it in the first line. – Johannes Ernst Mar 18 '16 at 23:40
You have to knwo in advance. Of course you could push the hack a bit further and read the xml header for parsing the charset and treast input accordingly. A xml stack intrinisc solution still is much more preferrable. – rpy Mar 19 '16 at 21:26

score 1 · Answer 5 · edited May 12 '17 at 17:23

I made yesterday something similar i need to add value from unziped XML in stream to database.

//import I'm not sure if all are necessary :) 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

//I didnt checked this code now because i'm in work for sure its work maybe 
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);

// lib which i use common-lang3.jar
//metod to parse 
public static String parseToChar( String words){

    String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);

        return decode;
 }

score 1 · Answer 6 · answered Nov 26 '18 at 13:50

Try this using org.apache.commons package :

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();

InputStream in = new FileInputStream(xmlfile);    
String unescapeHtml4 = IOUtils.toString(in);

CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
         );

unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);

InputSource is = new InputSource(readerInput);
Document doc    = parser.parse(is);

Parsing XML file containing HTML entities in Java without changing the XML

6 Answers6

Main.java

test.xml:

StAXExpand.java

CompactTokenizer.java

Linked

Related