0
public XMLParser(InputStream is) {
    try {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db;
        db = dbf.newDocumentBuilder();
        Document doc = db.parse(is);
        node = doc.getDocumentElement();
    } catch (Exception e) {
        DebugLog.log(e);
    }
}

The inputStream contains content like: "Hey there this is a ü character." The character 'ü' is a 'ΓΌ';

When reading the node's content System.out.println(node.getTextContent()) I receive "hey there this is a character." ü is cut of.

Basic Coder
  • 10,882
  • 6
  • 42
  • 75

2 Answers2

0

Well, is this a valid document? Does it have encoding specified?-> http://www.w3schools.com/XML/xml_encoding.asp

Those might help:

Howto let the SAX parser determine the encoding from the xml declaration? http://www.coderanch.com/t/127052/XML/XML-parsers-encoding-byte-order

Community
  • 1
  • 1
baranowb
  • 543
  • 4
  • 6
  • It's a HTML Webpage. ISO-8859-1 – Basic Coder Sep 22 '12 at 09:34
  • What is the default charset on device/machine? – baranowb Sep 22 '12 at 09:36
  • Ach, just noticed tag. **IIRC** if not specified, the reader/parser assumes device( UTF-8 in this case ) encoding. You need to specify encoding( http://docs.oracle.com/javase/1.4.2/docs/api/java/io/InputStreamReader.html) or create some custom InputStream which peeks encoding. – baranowb Sep 22 '12 at 09:53
0

The Problem was the XML Entities and HTML Entities. I request a webpage which returns data with HTML Entities. I had to convert the HTML Entities to XML Entities and it worked!

Check this answer for some code

Community
  • 1
  • 1
Basic Coder
  • 10,882
  • 6
  • 42
  • 75