1

I have some .xml files that are encoded in UTF-8. But whenever I try to parse them on my tablet (idea pad, lenovo, android 3.1), I get the same error:

org.xml.SAXParseException: Unexpected token (position: TEXT @1:2 in 
java.io.StringReader@40bdaef8).

These are the lines that throw the exception:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xmlData));
Document doc = db.parse(inputSource); // This line throws exception

Here is my input:

public String getFromFile(ASerializer aserializer) {
    String filename = aserializer.toLocalResource();
    String data = new String();
    try {
        InputStream stream = _context.getResources().getAssets().open(filename);
        BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
        StringBuilder str = new StringBuilder();
        String line = null;
        while((line = reader.readLine()) != null) {
            str.append(line);
        }
            stream.close();
            data = str.toString();
   }

           catch(Exception e) {
       }
       return data;
    }

XML File:

<Results>
    <Result title="08/07/2011">
        <Field title="Company one" value="030589674"/>
        <Field title="Company two" value="081357852"/>
        <Field title="Company three" value="093587125"/>
        <Field title="Company four" value="095608977"/>
    </Result>
    <Result title="11/07/2011">
        <Field title="Company one" value="030589674"/>
        <Field title="Company two" value="081357852"/>
    </Result>
</Results>

I don't want to convert them to ANSI, so is there any way to make the db.parse() work?

iCantSeeSharp
  • 3,880
  • 4
  • 42
  • 65

3 Answers3

4

At this line:

BufferedReader reader = new BufferedReader(new InputStreamReader(stream));

You're reading from stream using the platform default encoding. That's almost certainly not what you want. You'd need to check the XML for for the actual encoding and the correct way to do that is somewhat complicated.

Luckily, every sane XML parser (including the Java/Android one) can do that on its own. To make the XML parser do that, simply pass in the stream itself instead of trying to read it manually.

InputSource inputSource = new InputSource(stream);
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • I have the file stream now. So, do I have to create a new Utf-8 String like data = new String(inputSource, "UTF-8"); ? – iCantSeeSharp Oct 25 '11 at 08:02
  • @coder: **No**! you **don't need a `String`**. Simply create an `InputSource` from the `stream` and pass that to the parser. – Joachim Sauer Oct 25 '11 at 08:05
  • 1
    I'm adding a new method to my parsers with InputSource as argument. – iCantSeeSharp Oct 25 '11 at 08:13
  • I get a NullPointer Exception when I try to read the inputSource: Does this look ok? Reader reader = new InputStreamReader(stream,"UTF-8"); inputSource = new InputSource(reader); stream.close(); – iCantSeeSharp Oct 25 '11 at 08:32
  • 1
    1. Don't close your stream until the parsing has completed. 2. **don't create a `Reader`!** `InputSource` has [a constructor that takes an `InputStream`](http://developer.android.com/reference/org/xml/sax/InputSource.html#InputSource(java.io.InputStream))! – Joachim Sauer Oct 25 '11 at 08:35
  • So, I'd rather return the stream and pass it as argument, parse it and then close the stream. Does this scenario sound better? – iCantSeeSharp Oct 25 '11 at 08:38
  • Yes, that's **exactly** what I was telling you to do in my post 1 hour ago. – Joachim Sauer Oct 25 '11 at 08:39
  • I see, thanks. Well, my Managers were of general purpose. I load xmldata from http source as well, so I had to create a new method to do this and I was just doing so many changes I forgot the OP of yours. Thank you very much, in a few minutes I suppose it's going to be working. – iCantSeeSharp Oct 25 '11 at 08:42
  • 1
    @coder: no matter **where** you load XML from: trying to convert it to a `String` manually before parsing is a **bad idea**. If your input is (at some point) an `InputStream` then you should **always** pass *that `InputStream`* to the parser. – Joachim Sauer Oct 25 '11 at 08:45
  • I see, then I can fix the http getter to send the stream as well. I tested it and it works, the only thing left it to make the greek characters appear correctly. So, I guess this is where the setEncode() fits somewhere. – iCantSeeSharp Oct 25 '11 at 08:57
  • @coder: if your XML is correct, then you **never** need to specify the encoding when parsing it: the XML parser must be able to figure it out. It *might* be necessary to specify some encoding later on, but that's unrelated to the XML parsing, then. – Joachim Sauer Oct 25 '11 at 09:00
  • I just figured that. I have the same results as before. I'll need to remove the .setEncode() before parsing and and then find a way to fix the files or do something about it, after the fetching of the contents. – iCantSeeSharp Oct 25 '11 at 09:02
1

Your java string is in an UTF-16 encoding be default. If you can't use InputStream as @Joachim Sauer suggested, then try this:

Document doc = db.parse(new ByteArrayInputStream(xmlData.getBytes())); 
pleerock
  • 18,322
  • 16
  • 103
  • 128
1

You are quite likely using an XML file with a BOM mark (Byte Order Mark).

Either use an API that detects the encoding from the BOM

Alternatively, preprocess the file so that no BOM is present.

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633