1

I have a single InputStream or String with two xmls in it, like so:

<?xml version="1.0" standalone="yes"?> 
<items 
    blahblahblah1 
</items>           
<?xml version="1.0" standalone="yes"?> 
<items 
    blahblahblah2 
</items> 

They have the same format but different data. I would like to parse them, but since this is not valid xml first I need to find a way to split them up.

The only things that come to mind are String operations:

  1. Split them up into two separate strings, by the substring <?xml version="1.0 standalone="yes"?>
  2. Search for and remove the two <?xml version="1.0 standalone="yes"?> lines and surround the remainder with <ROOT> </ROOT> to make a single valid xml, and figure out how to parse it from there

However both of these methods seem hacky and inefficient. Is there a better way?

Kalina
  • 5,504
  • 16
  • 64
  • 101
  • 1
    If you sure that the ` – Praveen Sep 18 '12 at 17:19
  • 2
    Having two XML's in one String is already hacky enough. I'd go for the split. – Xavi López Sep 18 '12 at 17:46
  • 1
    It'd be interesting to know why the XML data has to be in this form, and if it could be avoided. However, if it simply has to be this way, it would be useful to know what format and typical size this data arrives in. Also, which parser are you going for, e.g. SAX? I ask because, if it could be quite large, and originates from a File for example, then it might be nicer to solve this problem custom buffered reader that wraps around the `InputStream`. If however the sizes are small, then just use `String`s as you suggest, and perhaps wrap the `String`s with `ByteArrayInputStream` and use SAX. – Trevor Sep 18 '12 at 18:05
  • It doesn't have to be a String; I've been messing with it for a while and that's just where it's current state ended up... These XMLs are returned as part of a SOAP web service request: ` [the two xmls are here] `. I used a DOM parser to get this data from a BasicHttpResponse, to an InputStream, to a NodeList, to a String. – Kalina Sep 18 '12 at 20:13
  • @Trevor Also, yes, the sizes are quite large. – Kalina Sep 18 '12 at 21:15

2 Answers2

1

It's a bad design, because the string "<?xml" could appear legitimately within a CDATA section or comment. But you're just going to have to take the plunge, and split the file whereever you see "<?xml" appear, hope for the best, and blame whoever came up with this idea if it does wrong. The only alternative is to write your own parser for this variant of XML, which isn't going to be much fun.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
1

Nothing I propose here is tested, but these are the routes I think I would take.

If the response length is expected to be small, I would personally probably just go for placing the concatenated XML response into a String as you suggest, and then either use standard String methods to extract the individual XML documents, or again as you suggested, remove the XML declaration strings and wrap the whole lot with a pair of root elements. It would depend on whether you wanted to feed your XML parser with a single document or multiple. I haven't dealt with BasicHttpResponse in ages, but I think you can get an InputStream of the response entity using mBasicHttpResponse.getEntity().getContent(), and then use one of many ways possible to obtain a String from that InputStream.

If on the other hand I expect to be dealing with pretty lengthy data or if the response entity could contain an indeterminate number of concatenated XML documents, I would then instead think about wrapping the obtained InputStream with a custom InputStream or Reader that performs (a) stripping away of the declarations and (b) insertion of new root elements. There's someone else on SO who asked a very similar question to the problem you're facing here except he didn't have the declarations to deal with. Looking at user656449's answer, we see a suggestion of how to wrap an InputStream with some dummy root elements before passing it to the SAX parser:

(Blatantly copied from referenced SO question / answer):

SAXParserFactory saxFactory = SAXParserFactory.newInstance();
SAXParser parser = saxFactory.newSAXParser();

parser.parse(
    new SequenceInputStream(
        Collections.enumeration(Arrays.asList(
        new InputStream[] {
            new ByteArrayInputStream("<dummy>".getBytes()),
            new FileInputStream(file),//bogus xml
            new ByteArrayInputStream("</dummy>".getBytes()),
        }))
    ), 
    new DefaultHandler()
);

But additionally in this situation, you would replace the FileInputStream with some kind of CustomFilterFileInputStream that you create yourself to perform stripping of the declaration lines. Your CustomFilterFileInputStream would wrap around the InputStream obtained from your BasicHttpResponse and then use of the SequenceInputStream adds the new root tags.

That's the kind of direction I think you would need to go if you really have to accept the XML data in this way, and if you expect to deal with large amounts of it in a single response.

Community
  • 1
  • 1
Trevor
  • 10,903
  • 5
  • 61
  • 84