3

I'm writing a client which needs to read multiple consecutive small XML documents over a socket. I can assume that the encoding is always UTF-8 and that there is optionally delimiting whitespace between documents. The documents should ultimately go into DOM objects. What is the best way to accomplish this?

The essense of the problem is that the parsers expect a single document in the stream and consider the rest of the content junk. I thought that I could artificially end the document by tracking the element depth, and creating a new reader using the existing input stream. E.g. something like:

// Broken 
public void parseInputStream(InputStream inputStream) throws Exception
{
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLOutputFactory xof = XMLOutputFactory.newInstance();
    XMLEventFactory eventFactory = XMLEventFactory.newInstance();        
    DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
    Document doc = documentBuilder.newDocument();
    XMLEventWriter domWriter = xof.createXMLEventWriter(new DOMResult(doc));
    XMLStreamReader xmlStreamReader = factory.createXMLStreamReader(inputStream);
    XMLEventReader reader = factory.createXMLEventReader(xmlStreamReader);
    int depth = 0;

    while (reader.hasNext()) {
        XMLEvent evt = reader.nextEvent();
        domWriter.add(evt);

        switch (evt.getEventType()) {
        case XMLEvent.START_ELEMENT:
            depth++;
            break;

        case XMLEvent.END_ELEMENT:
            depth--;

            if (depth == 0) 
            {                       
                domWriter.add(eventFactory.createEndDocument());
                System.out.println(doc);
                reader.close();
                xmlStreamReader.close();

                xmlStreamReader = factory.createXMLStreamReader(inputStream);
                reader = factory.createXMLEventReader(xmlStreamReader);

                doc = documentBuilder.newDocument();
                domWriter = xof.createXMLEventWriter(new DOMResult(doc));    
                domWriter.add(eventFactory.createStartDocument());
            }
            break;                    
        }
    }
}

However running this on input such as <a></a><b></b><c></c> prints the first document and throws an XMLStreamException. Whats the right way to do this?

Clarification: Unfortunately the protocol is fixed by the server and cannot be changed, so prepending a length or wrapping the contents would not work.

eaubin
  • 658
  • 9
  • 14
  • Can't you just catch the XMLStreamException and use it as a trigger to parse the input stream again for the next document? – Andreas Dolk May 28 '09 at 14:15

9 Answers9

3
  • Length-prefix each document (in bytes).
  • Read the length of the first document from the socket
  • Read that much data from the socket, dumping it into a ByteArrayOutputStream
  • Create a ByteArrayInputStream from the results
  • Parse that ByteArrayInputStream to get the first document
  • Repeat for the second document etc
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
1

just change to whatever stream

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;

import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

public class LogParser {

    private XMLInputFactory inputFactory = null;
    private XMLStreamReader xmlReader = null;
    InputStream is;
    private int depth;
    private QName rootElement;

    private static class XMLStream extends InputStream
    {
        InputStream delegate;
        StringReader startroot = new StringReader("<root>");
        StringReader endroot = new StringReader("</root>");

        XMLStream(InputStream delegate)
        {
            this.delegate = delegate;
        }

        public int read() throws IOException {
            int c = startroot.read();
            if(c==-1)
            {
                c = delegate.read();
            }
            if(c==-1)
            {
                c = endroot.read();
            }
            return c;
        }

    }

    public LogParser() {
        inputFactory = XMLInputFactory.newInstance();
    }

    public void read() throws Exception {
        is = new XMLStream(new FileInputStream(new File(
            "./myfile.log")));
        xmlReader = inputFactory.createXMLStreamReader(is);

        while (xmlReader.hasNext()) {
            printEvent(xmlReader);
            xmlReader.next();
        }
        xmlReader.close();

    }

    public void printEvent(XMLStreamReader xmlr) throws Exception {
        switch (xmlr.getEventType()) {
        case XMLStreamConstants.END_DOCUMENT:
            System.out.println("finished");
            break;
        case XMLStreamConstants.START_ELEMENT:
            System.out.print("<");
            printName(xmlr);
            printNamespaces(xmlr);
            printAttributes(xmlr);
            System.out.print(">");
            if(rootElement==null && depth==1)
            {
                rootElement = xmlr.getName();
            }
            depth++;
            break;
        case XMLStreamConstants.END_ELEMENT:
            System.out.print("</");
            printName(xmlr);
            System.out.print(">");
            depth--;
            if(depth==1 && rootElement.equals(xmlr.getName()))
            {
                rootElement=null;
                System.out.println("finished element");
            }
            break;
        case XMLStreamConstants.SPACE:
        case XMLStreamConstants.CHARACTERS:
            int start = xmlr.getTextStart();
            int length = xmlr.getTextLength();
            System.out
                    .print(new String(xmlr.getTextCharacters(), start, length));
            break;

        case XMLStreamConstants.PROCESSING_INSTRUCTION:
            System.out.print("<?");
            if (xmlr.hasText())
                System.out.print(xmlr.getText());
            System.out.print("?>");
            break;

        case XMLStreamConstants.CDATA:
            System.out.print("<![CDATA[");
            start = xmlr.getTextStart();
            length = xmlr.getTextLength();
            System.out
                    .print(new String(xmlr.getTextCharacters(), start, length));
            System.out.print("]]>");
            break;

        case XMLStreamConstants.COMMENT:
            System.out.print("<!--");
            if (xmlr.hasText())
                System.out.print(xmlr.getText());
            System.out.print("-->");
            break;

        case XMLStreamConstants.ENTITY_REFERENCE:
            System.out.print(xmlr.getLocalName() + "=");
            if (xmlr.hasText())
                System.out.print("[" + xmlr.getText() + "]");
            break;

        case XMLStreamConstants.START_DOCUMENT:
            System.out.print("<?xml");
            System.out.print(" version='" + xmlr.getVersion() + "'");
            System.out.print(" encoding='" + xmlr.getCharacterEncodingScheme()
                    + "'");
            if (xmlr.isStandalone())
                System.out.print(" standalone='yes'");
            else
                System.out.print(" standalone='no'");
            System.out.print("?>");
            break;

        }
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        try {
            new LogParser().read();
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private static void printName(XMLStreamReader xmlr) {
        if (xmlr.hasName()) {
            System.out.print(getName(xmlr));
        }
    }

    private static String getName(XMLStreamReader xmlr) {
        if (xmlr.hasName()) {
            String prefix = xmlr.getPrefix();
            String uri = xmlr.getNamespaceURI();
            String localName = xmlr.getLocalName();
            return getName(prefix, uri, localName);
        }
        return null;
    }

    private static String getName(String prefix, String uri, String localName) {
        String name = "";
        if (uri != null && !("".equals(uri)))
            name += "['" + uri + "']:";
        if (prefix != null)
            name += prefix + ":";
        if (localName != null)
            name += localName;
        return name;
    }   

    private static void printAttributes(XMLStreamReader xmlr) {
        for (int i = 0; i < xmlr.getAttributeCount(); i++) {
            printAttribute(xmlr, i);
        }
    }

    private static void printAttribute(XMLStreamReader xmlr, int index) {
        String prefix = xmlr.getAttributePrefix(index);
        String namespace = xmlr.getAttributeNamespace(index);
        String localName = xmlr.getAttributeLocalName(index);
        String value = xmlr.getAttributeValue(index);
        System.out.print(" ");
        System.out.print(getName(prefix, namespace, localName));
        System.out.print("='" + value + "'");
    }

    private static void printNamespaces(XMLStreamReader xmlr) {
        for (int i = 0; i < xmlr.getNamespaceCount(); i++) {
            printNamespace(xmlr, i);
        }
    }

    private static void printNamespace(XMLStreamReader xmlr, int index) {
        String prefix = xmlr.getNamespacePrefix(index);
        String uri = xmlr.getNamespaceURI(index);
        System.out.print(" ");
        if (prefix == null)
            System.out.print("xmlns='" + uri + "'");
        else
            System.out.print("xmlns:" + prefix + "='" + uri + "'");
    }

}
1

IIRC, XML documents can have comments and processing-instructions at the end, so there's no real way of telling exactly when you have come to the end of the file.

A couple of ways of handling the situation have already been mentioned. Another alternative is to put in an illegal character or byte into the stream, such as NUL or zero. This has the advantage that you don't need to alter the documents and you never need to buffer an entire file.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
0

I was faced with a similar problem. A web service I'm consuming will (in some cases) return multiple xml documents in response to a single HTTP GET request. I could read the entire response into a String and split it, but instead I implemented a splitting input stream based on user467257's post above. Here is the code:

public class AnotherSplittingInputStream extends InputStream {
    private final InputStream realStream;
    private final byte[] closeTag;

    private int matchCount;
    private boolean realStreamFinished;
    private boolean reachedCloseTag;

    public AnotherSplittingInputStream(InputStream realStream, String closeTag) {
        this.realStream = realStream;
        this.closeTag = closeTag.getBytes();
    }

    @Override
    public int read() throws IOException {
        if (reachedCloseTag) {
            return -1;
        }

        if (matchCount == closeTag.length) {
            matchCount = 0;
            reachedCloseTag = true;
            return -1;
        }

        int ch = realStream.read();
        if (ch == -1) {
            realStreamFinished = true;
        }
        else if (ch == closeTag[matchCount]) {
            matchCount++;
        } else {
            matchCount = 0;
        }
        return ch;
    }

    public boolean hasMoreData() {
        if (realStreamFinished == true) {
            return false;
        } else {
            reachedCloseTag = false;
            return true;
        }
    }
}

And to use it:

String xml =
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
        "<root>first root</root>" +
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
        "<root>second root</root>";
ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());
SplittingInputStream splitter = new SplittingInputStream(is, "</root>");
BufferedReader reader = new BufferedReader(new InputStreamReader(splitter));

while (splitter.hasMoreData()) {
    System.out.println("Starting next stream");
    String line = null;
    while ((line = reader.readLine()) != null) {
        System.out.println("line ["+line+"]");
    }
}
colini
  • 736
  • 8
  • 11
  • 1
    This only works for XML where the root tag is unique and there is no whitespace in between. – eckes Jul 23 '12 at 23:25
  • That's true and that was the problem I was trying to solve. It would be straightforward to consume "whitespace/junk" data between the xml docs, esp if each begins with the header. If the root elements are not unique then it is more complicated but still doable. – colini Jul 25 '12 at 15:57
0

I use JAXB approach to unmarshall messages from multiply stream:

MultiInputStream.java

public class MultiInputStream extends InputStream {
    private final Reader source;
    private final StringReader startRoot = new StringReader("<root>");
    private final StringReader endRoot = new StringReader("</root>");

    public MultiInputStream(Reader source) {
        this.source = source;
    }

    @Override
    public int read() throws IOException {
        int count = startRoot.read();
        if (count == -1) {
            count = source.read();
        }
        if (count == -1) {
            count = endRoot.read();
        }
        return count;
    }
}

MultiEventReader.java

public class MultiEventReader implements XMLEventReader {

    private final XMLEventReader reader;
    private boolean isXMLEvent = false;
    private int level = 0;

    public MultiEventReader(XMLEventReader reader) throws XMLStreamException {
        this.reader = reader;
        startXML();
    }

    private void startXML() throws XMLStreamException {
        while (reader.hasNext()) {
            XMLEvent event = reader.nextEvent();
            if (event.isStartElement()) {
                return;
            }
        }
    }

    public boolean hasNextXML() {
        return reader.hasNext();
    }

    public void nextXML() throws XMLStreamException {
        while (reader.hasNext()) {
            XMLEvent event = reader.peek();
            if (event.isStartElement()) {
                isXMLEvent = true;
                return;
            }
            reader.nextEvent();
        }
    }

    @Override
    public XMLEvent nextEvent() throws XMLStreamException {
        XMLEvent event = reader.nextEvent();
        if (event.isStartElement()) {
            level++;
        }
        if (event.isEndElement()) {
            level--;
            if (level == 0) {
                isXMLEvent = false;
            }
        }
        return event;
    }

    @Override
    public boolean hasNext() {
        return isXMLEvent;
    }

    @Override
    public XMLEvent peek() throws XMLStreamException {
        XMLEvent event = reader.peek();
        if (level == 0) {
            while (event != null && !event.isStartElement() && reader.hasNext()) {
                reader.nextEvent();
                event = reader.peek();
            }
        }
        return event;
    }

    @Override
    public String getElementText() throws XMLStreamException {
        throw new NotImplementedException();
    }

    @Override
    public XMLEvent nextTag() throws XMLStreamException {
        throw new NotImplementedException();
    }

    @Override
    public Object getProperty(String name) throws IllegalArgumentException {
        throw new NotImplementedException();
    }

    @Override
    public void close() throws XMLStreamException {
        throw new NotImplementedException();
    }

    @Override
    public Object next() {
        throw new NotImplementedException();
    }

    @Override
    public void remove() {
        throw new NotImplementedException();
    }
}

Message.java

@XmlAccessorType(XmlAccessType.FIELD)
@XmlRootElement(name = "Message")
public class Message {

    public Message() {
    }

    @XmlAttribute(name = "ID", required = true)
    protected long id;

    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    @Override
    public String toString() {
        return "Message{id=" + id + '}';
    }
}

Read multiply messages:

public static void main(String[] args) throws Exception{

    StringReader stringReader = new StringReader(
            "<Message ID=\"123\" />\n" +
            "<Message ID=\"321\" />"
    );

    JAXBContext context = JAXBContext.newInstance(Message.class);
    Unmarshaller unmarshaller = context.createUnmarshaller();

    XMLInputFactory inputFactory = XMLInputFactory.newFactory();
    MultiInputStream multiInputStream = new MultiInputStream(stringReader);
    XMLEventReader xmlEventReader = inputFactory.createXMLEventReader(multiInputStream);
    MultiEventReader multiEventReader = new MultiEventReader(xmlEventReader);

    while (multiEventReader.hasNextXML()) {
        Object message = unmarshaller.unmarshal(multiEventReader);
        System.out.println(message);
        multiEventReader.nextXML();
    }
}

results:

Message{id=123}
Message{id=321}
IvanNik
  • 2,007
  • 2
  • 13
  • 12
0

Found this forum message (which you probably already saw), which has a solution by wrapping the input stream and testing for one of two ascii characters (see post).

You could try an adaptation on this by first converting to use a reader (for proper character encoding) and then doing element counting until you reach the closing element, at which point you trigger the EOM.

deterb
  • 3,994
  • 1
  • 28
  • 33
0

Hi I also had this problem at work (so won't post resulting the code). The most elegant solution that I could think of, and which works pretty nicely imo, is as follows

Create a class for example DocumentSplittingInputStream which extends InputStream and takes the underlying inputstream in its constructor (or gets set after construction...). Add a field with a byte array closeTag containing the bytes of the closing root node you are looking for. Add a field int called matchCount or something, initialised to zero. Add a field boolean called underlyingInputStreamNotFinished, initialised to true

On the read() implementation:

  1. Check if matchCount == closeTag.length, if it does, set matchCount to -1, return -1
  2. If matchCount == -1, set matchCount = 0, call read() on the underlying inputstream until you get -1 or '<' (the xml declaration of the next document on the stream) and return it. Note that for all I know the xml spec allows comments after the document element, but I knew I was not going to get that from the source so did not bother handling it - if you can not be sure you'll need to change the "gobble" slightly.
  3. Otherwise read an int from the underlying inputstream (if it equals closeTag[matchCount] then increment matchCount, if it doesn't then reset matchCount to zero) and return the newly read byte

Add a method which returns the boolean on whether the underlying stream has closed. All reads on the underlying input stream should go through a separate method where it checks if the value read is -1 and if so, sets the field "underlyingInputStreamNotFinished" to false.

I may have missed some minor points but i'm sure you get the picture.

Then in the using code you do something like, if you are using xstream:

DocumentSplittingInputStream dsis = new DocumentSplittingInputStream(underlyingInputStream);
while (dsis.underlyingInputStreamNotFinished()) {
    MyObject mo = xstream.fromXML(dsis);
    mo.doSomething(); // or something.doSomething(mo);
}

David

user467257
  • 1,714
  • 1
  • 16
  • 19
0

I had to do something like this and during my research on how to approach it, I found this thread that even though it is quite old, I just replied (to myself) here wrapping everything in its own Reader for simpler use

Community
  • 1
  • 1
Filipe Pina
  • 2,201
  • 23
  • 35
0

A simple solution is to wrap the documents on the sending side in a new root element:

<?xml version="1.0"?>
<documents>
    ... document 1 ...
    ... document 2 ...
</documents>

You must make sure that you don't include the XML header (<?xml ...?>), though. If all documents use the same encoding, this can be accomplished with a simple filter which just ignores the first line of each document if it starts with <?xml

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • Issue with this could be that he many need to be able to parse each document separately. A work around could be to take the SAX parser being used to generate the DOM and have it wrap a DOM generating SAX parser, which could then actually generate the smaller documents from the larger document, which would never actually end. – deterb Oct 20 '09 at 23:23