1

I have an XML file to be read by SAX parser, store into CSV and import in Database.

In my XML file, there is an Author with name: <author>G&uuml;nther Heinemann</author> As you can see the "ü" in XML is written as &umml;.

SO yes I need to store the author in a Database. I cant store that character as "ü". I need to store it as &umml;

But when I use SAX parser to read from XML, it keep reading as "ü" instead of &umml; How can I make Java store as &umml; and not "ü"?

Thank you

bdoughan
  • 147,609
  • 23
  • 300
  • 400
user2741620
  • 305
  • 2
  • 7
  • 21
  • possible duplicate of [SAX parser: Ignoring special characters](http://stackoverflow.com/questions/5475202/sax-parser-ignoring-special-characters) – Ludovic Kuty Oct 25 '13 at 08:00

3 Answers3

0

It probably is slow too, as likely a huge HTML DTD with includes is read. However you need that as a single ampersand (&) is not allowed. That HTML DTD defines hundreds of HTML entity names, like &perc; (%).

The DTD could be taken from an XML catalog, what is an offline local version for that HTML URL. You could then change the entities. But that is too much work.

What one could do is to install your own EntityHandler in the parser and so on. Research work, relatively easy.

Easiest would be to wrap the input in your own InputStream/Reader, say a BufferedReader substituting & with &amp;, that would do the substitutions needed.

In XML: &amp;uuml; instead of &uuml;.

line = line.replace("&", "&amp;");
// Undo XML escapes:
String[] xmlTags = { "amp", "lt", "gt", "quot", "apos" };
for (String xmlTag : xmlTags) {
    line = line.replace("&amp;" + xmlTag + ";", "&" + xmlTag + ";");
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

Use Apache Commons Lang's StringEscapeUtils utility methods escapeHtml() and unescapeHtml()

String plain = StringEscapeUtils.unescapeHtml(htmlSafe);

String htmlSafe = StringEscapeUtils.escapeHtml(plain);
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Doesn't `unescapeXml()` do the opposite of what the OP asks for? I would try [escapeHtml()](http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml%28java.lang.String%29) instead. – mzjn Oct 12 '13 at 12:57
  • I haven't actually tried any of this myself, but according to the documentation, `escapeXml()` and `unescapeXml()` only support the five built-in XML entities. In order to turn `ü` into `ü`, it seems that you'd have to use `escapeHtml()`. – mzjn Oct 12 '13 at 13:10
  • One more thing: in the latest version (3.1) of Apache Commons Lang, I noticed some changes in the API. For example, `escapeHtml()` has become [`escapeHtml4()`](http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html#escapeHtml4(java.lang.String)). – mzjn Oct 13 '13 at 12:57
0

You could use a modified version of the code below that catches the start and end of entities. It takes a few seconds to execute since the parser has to fetch the declarations of all HTML latin1 entities. When you get an entity that does not start with %, you could do the replacement of the inserted char in your acc buffer. Pay attention to predefined entities like &amp;.

You could also use a Sax filter to the job automatically. Cfr. answer https://stackoverflow.com/a/5524862/452614. I might update my answer to provide a complete solution.

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.*;
import org.xml.sax.ext.DefaultHandler2;

class MyHandler extends DefaultHandler2 {

    private StringBuilder acc;

    public MyHandler() {
        acc = new StringBuilder();
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes atts) throws SAXException {
        System.out.printf("startElement. uri:%s, localName:%s, qName:%s\n", uri,     localName, qName);
        acc.setLength(0);
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        System.out.printf("endElement. uri:%s, localName:%s, qName:%s\n", uri,     localName, qName);
        System.out.printf("Characters accumulated: %s\n", acc.toString());
        acc.setLength(0);
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        acc.append(ch, start, length);
        System.out.printf("characters. [%s]\n", new String(ch, start, length));
    }

    @Override
    public void startEntity(java.lang.String name)
            throws SAXException {
        System.out.printf("startEntity: %s\n", name);
    }

    @Override
    public void endEntity(java.lang.String name)
            throws SAXException {
        System.out.printf("endEntity: %s\n", name);
    }
}

public class SAXTest1 {

    public static void main(String args[]) throws SAXException,     ParserConfigurationException, UnsupportedEncodingException {
        String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE author [\n<    !ELEMENT author (#PCDATA)>\n<!ENTITY % HTMLlat1 PUBLIC \"-//W3C//ENTITIES     Latin 1 for XHTML//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent\">    \n%HTMLlat1;\n]>\n<author>G&uuml;nther Heinemann</author>";
        System.out.println(s);
        InputStream stream = new ByteArrayInputStream(s.getBytes("UTF-8"));

        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating(true);
        XMLReader xmlReader = factory.newSAXParser().getXMLReader();

        DefaultHandler2 handler = new MyHandler();
        xmlReader.setContentHandler(handler);
        xmlReader.setProperty(
                "http://xml.org/sax/properties/lexical-handler",
                handler);

        try {
            xmlReader.parse(new InputSource(stream));
        } catch (IOException e) {
            System.err.println("I/O error: " + e.getMessage());
        } catch (SAXException e) {
            System.err.println("Parsing error: " + e.getMessage());
        }
    }
}

Program execution :

$ java SAXTest1
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE author [
<!ELEMENT author (#PCDATA)>
<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.    org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
]>
<author>G&uuml;nther Heinemann</author>
startEntity: %HTMLlat1
endEntity: %HTMLlat1
startElement. uri:, localName:, qName:author
characters. [G]
startEntity: uuml
endEntity: uuml
characters. [ünther Heinemann]
endElement. uri:, localName:, qName:author
Characters accumulated: Günther Heinemann
Community
  • 1
  • 1
Ludovic Kuty
  • 4,868
  • 3
  • 28
  • 42