3

I'm using document builder and NodeList in Android Studio to parse an xml document. I previously found that the xml was incorrect and had un-escaped ampersands within the text. After taking care of this though and double check with w3 XML validator, I still get an unexpected token error:

e: "org.xml.sax.SAXParseException: Unexpected token (position:TEXT \n \n 601\n ...@5262:1 in java.io.StringReader@cd0db4a)"

However, when I open the xml and look at the line referred to, I don't see anything that would be considered troublesome:

...  ...
5257 <WebSvcLocation>
5258 <Id>1521981</Id>
5259 <Name>Warehouse: Row 3</Name>
5260 <SiteName>Warehouse</SiteName>
5261 </WebSvcLocation>
5262 </ArrayOfWebSvcLocation>

I have checked the xml as well for non printing characters and I have not found any. Below is the code I have been using:

public List<Location> SpinnerXML(String xml){
    List<Location> list = new ArrayList<Location>();
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder;
    InputSource is;
    String s = xml.replaceAll("[&]"," and ");

    try {
        builder = factory.newDocumentBuilder();
        is = new InputSource(new StringReader(s));
        Document doc = builder.parse(is);
        NodeList lt = doc.getElementsByTagName("WebSvcLocation");
        int id;
        String name,siteName;

        for (int i = 0; i < lt.getLength(); i++) {
            Element el = (Element) lt.item(i);
            id = Integer.parseInt(getValue(el, "Id"));
            name = getValue(el, "Name");
            siteName = getValue(el, "SiteName");

            list.add(new Location(id, name, siteName));
        }

    } catch (ParserConfigurationException e){
    } catch (SAXException e){
        e.printStackTrace();
    } catch (IOException e){
    }

    return list;
}

The XML I have been trying to read is hosted here.

Thanks in advance for the help!

charwayne
  • 103
  • 1
  • 3
  • 18
  • 1
    Can you share the data for the whole of this WebSvcLocation element? (presumably an opening tag exists too). Also, what is after line 5262? – khriskooper Aug 11 '17 at 14:22
  • There is only an open line after 5262, otherwise that is the end of the document. I'll edit the question to include the whole WebSvcLocation element – charwayne Aug 11 '17 at 14:27
  • 1
    Hmm, I would double check that your XML file is all good. Are you using the correct encoding / line endings / charset, etc? Is it possible to host the XML file somewhere so we can check it over? – khriskooper Aug 11 '17 at 15:09
  • 1
    I've hosted the XML [here](http://text-share.com/view/7b1c6918) – charwayne Aug 11 '17 at 18:12

1 Answers1

1

InputSource seems to do some guessing as to the encoding, so here's some things to try.

From here it says:

Android note: The Android platform default (encoding) is always UTF-8.

Referenced from here

Java stores strings as UTF-16 internally.

"Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies.

(1) I would initially recommend:

is.setEncoding("UTF-8");

(2) But it should do no harm to replace this:

Document doc = builder.parse(is);

With this:

Document doc = builder.parse(new ByteArrayInputStream(s.getBytes()));

(3) OR try this:

String s1 = URLDecoder.decode(s, "UTF-8");
Document doc = builder.parse(new ByteArrayInputStream(s1.getBytes()));

NOTE: if you try (2) or (3) comment OUT:

is = new InputSource(new StringReader(s));

As it may mess up String s.

Community
  • 1
  • 1
Jon Goodwin
  • 9,053
  • 5
  • 35
  • 54
  • 1
    I've tried applying your suggestions but I still get the exact same error. My guess is that InputSource knows the encoding and the issue is actually something deeper. Thanks for your suggestions! – charwayne Aug 11 '17 at 18:05
  • 1
    From what I have read, it's best to get a raw input stream (the rawer the better) don't use InputSource, don't use String, the less the data is manipulated, the better. This has solved it for SOME people. But if the data is corrupted, I don't see what I do to help unless you provide an example BAD data file, and a function path through to your SpinnerXML method. I simply cannot replicate the bug without that. Have you tested and are confident in xml.replaceAll("[&]"," and "); ? Another posssiblity. best to dump that to logcat, if it's not too big, else a file. Oh you have shared it, cool. – Jon Goodwin Aug 11 '17 at 18:51
  • 1
    I'll take a look at your data, don't forget a dump of your modified data may be useful. You are not the only one having problems with the SAXParser. Got the data. – Jon Goodwin Aug 11 '17 at 19:00
  • [here](http://text-share.com/view/35cb1e34) is the dump from the string but I couldn't find the dump for the xml itself – charwayne Aug 11 '17 at 19:38
  • 1
    I think I (and others) have enough to be going on with (I should be able to replicate that). Anything downstream of builder.parse() I don't need. The rest compiles. I do have a life outside of this ;O) and have only slept for 3 hours in 48, so bear with me ;O). How the String xml got there, I can replicate/fix. Take care, you will hear from me if I find anything, good luck. To be honest I just past 3000 points (thanks to you), and I expected fireworks ;O), instead I got you can do more for us now. bummer ;O) Not even a badge ;O( – Jon Goodwin Aug 11 '17 at 20:03
  • 1
    On a serious note I think we have pretty much eliminated the UTF8 encoding problem, so time to think of some other answer. (1) bad original data (2) data corruption. (3) SAXParser is fubar.... – Jon Goodwin Aug 11 '17 at 20:12
  • 1
    I think bad original data is out, the xml being retrieved is generated from another system that has been in use for a long time and has been validated many times over. I guess data corruption could be possible but the string before being passing to the parser seems to be correct as well. My bet is on SAX being broken – charwayne Aug 11 '17 at 20:19
  • 1
    I've tested your code and it works PERFECTLY with YOUR code and YOUR test data. Here's the last record of the processed data:- Data: i=1051 id:1521981 name:Warehouse:Row 3 siteName:Warehouse. Where does it fall over (best show a bit more logcat) ?, and how to you get String xml ? (I get it from a raw resource) – Jon Goodwin Aug 16 '17 at 02:06
  • 1
    Ok, I finally figured out my issue. Your comment that it worked perfectly got me thinking. I realized I threw in an extra step in my code that messed up the formatting of the XML I was feeding it and as a result, I was getting a SAX issue. Thank you so much for your time and patience! – charwayne Aug 17 '17 at 14:27