XMLParser encoding problems

Question

public XMLParser(InputStream is) {
    try {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db;
        db = dbf.newDocumentBuilder();
        Document doc = db.parse(is);
        node = doc.getDocumentElement();
    } catch (Exception e) {
        DebugLog.log(e);
    }
}

The inputStream contains content like: "Hey there this is a ü character." The character 'ü' is a 'ü';

When reading the node's content System.out.println(node.getTextContent()) I receive "hey there this is a character." ü is cut of.

score 0 · Answer 1 · edited May 23 '17 at 11:56

0

Well, is this a valid document? Does it have encoding specified?-> http://www.w3schools.com/XML/xml_encoding.asp

Those might help:

Howto let the SAX parser determine the encoding from the xml declaration? http://www.coderanch.com/t/127052/XML/XML-parsers-encoding-byte-order

edited May 23 '17 at 11:56

Community

1
1

answered Sep 22 '12 at 09:31

baranowb

543
4
6

It's a HTML Webpage. ISO-8859-1 – Basic Coder Sep 22 '12 at 09:34
What is the default charset on device/machine? – baranowb Sep 22 '12 at 09:36
Ach, just noticed tag. **IIRC** if not specified, the reader/parser assumes device( UTF-8 in this case ) encoding. You need to specify encoding( http://docs.oracle.com/javase/1.4.2/docs/api/java/io/InputStreamReader.html) or create some custom InputStream which peeks encoding. – baranowb Sep 22 '12 at 09:53

score 0 · Answer 2 · edited May 23 '17 at 11:43

0

The Problem was the XML Entities and HTML Entities. I request a webpage which returns data with HTML Entities. I had to convert the HTML Entities to XML Entities and it worked!

Check this answer for some code

edited May 23 '17 at 11:43

Community

1
1

answered Sep 22 '12 at 10:12

Basic Coder

10,882
6
42
75

XMLParser encoding problems

2 Answers2