2

I have a file that has several XML documents like below in sequence.

<?xml version="1.0"?><Node>...<Node>...</Node>...</Node><?xml version...

which repeats several times.

I use Java, I have a FileChannel opened for the file and I have a byte buffer to read. Would appreciate if there is a built in way or an easier way or an already solved way to do a partial parsing of XML bytes with Java. For example like this:

FooParser parser = new FooParser();

while (...)
{
    buffer.flip();
    parser.parse(buffer);
    buffer.compact();
    if (parser.done())
    {
        xmlDocs.add(parser.xml());
        parser.reset();
    }
    file.read(buffer);
    ...
}
foobarometer
  • 751
  • 1
  • 9
  • 20

3 Answers3

2

There's nothing in the api that I know of that will parse multiple xml docs in a single stream. I think you're going to have to scan for the <?xml ... tags yourself and split up the input. The parser won't know that it's hit the next xml document until it reads the tag. At that point it will choke and the opening tag for the next xml doc will have already been read.

Actually, now that you mention it, you may be able to use a pull parser to do what you want. But I'm pretty sure the SAX and DOM parsers in the api won't do what you want.

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • The parser should be able to detect the end of the current XML right? Why would it read more than what is necessary, i.e., more than the current XML. – foobarometer Jun 12 '11 at 06:28
  • The parser is supposed to check for document well-formedness. One rule is that it has a single root tag. The parser will continue reading until it comes to the end, or until it comes to a second root-level tag and throws an exception. At that point, it will have read the second ` – Ted Hopp Jun 12 '11 at 06:31
  • Thanks Ted, I agree with you. This would violate the well-formedness rule and the parser would need to verify that. I'll leave the question for a while if someone has any insights, thanks! – foobarometer Jun 12 '11 at 06:38
  • I'm reconsidering my answer. You may be able to use StAX. See the javax.xml.stream package. – Ted Hopp Jun 12 '11 at 06:54
  • @Ted Hopp: I don't think that StAX will help, if I remember correctly, it too needs the XML doc to be well formed (same as with SAX though, the non-well-formedness will only be encountered after the successful parsing of the first part of the XML doc, that is when the first 'error' is encountered) – Yaneeve Jun 12 '11 at 08:11
  • 2
    Probably so. There's also the problem that any of the parsers may do some internal buffering, so even if you could arrange for them to stop after the closing tag, the input you need for the next document may already be gone. – Ted Hopp Jun 12 '11 at 13:38
1

I had to do something like this and I have answered (myself) here with a Reader subclass that wraps everything for simpler use.

Community
  • 1
  • 1
Filipe Pina
  • 2,201
  • 23
  • 35
0

It is common to check for the <? sequence at the start of the XML file because an XML file has to begin with the xml declaration actually (a BOM is not to be expected in the middle of the file). So I would take a look at the encoding and split the file as already suggested at every occurance of <? and "xml" afterwards...

Clemens
  • 1,744
  • 11
  • 20
  • Actually reading the whole file may not be an option for me. So I would probably write a parser for reading few bytes at a time using the file channel. Thanks! – foobarometer Jun 12 '11 at 08:50
  • Sure, just to split the file you don't need to read the whole file at once. – Clemens Jun 12 '11 at 10:12
  • Still, it is some work, what would one do if these were over a stream say from the network. Anyway, thanks! – foobarometer Jun 12 '11 at 10:25