1

I am using XMLStreamReader to achieve my goal(splitting xml file). It looks good, but still does not give the desired result. My aim is to split every node "nextTag" from an input file:

<?xml version="1.0" encoding="UTF-8"?>
<firstTag>
    <nextTag>1</nextTag>
    <nextTag>2</nextTag>
</firstTag>

The outcome should look like this:

<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>

Referring to Split 1GB Xml file using Java I achieved my goal with this code:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo4 {

    public static void main(String[] args) throws Exception {

        InputStream inputStream = new FileInputStream("input.xml");
        BufferedReader in = new BufferedReader(new InputStreamReader(inputStream));

        XMLInputFactory factory = XMLInputFactory.newInstance();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();

        XMLStreamReader streamReader = factory.createXMLStreamReader(in);

        while (streamReader.hasNext()) {
            streamReader.next();

            if (streamReader.getEventType() == XMLStreamReader.START_ELEMENT
                    && "nextTag".equals(streamReader.getLocalName())) {

                StringWriter writer = new StringWriter();
                t.transform(new StAXSource(streamReader), new StreamResult(
                        writer));
                String output = writer.toString();
                System.out.println(output);

            }

        }

    }

}

Actually very simple. But, my input file is in form from a single line:

<?xml version="1.0" encoding="UTF-8"?><firstTag><nextTag>1</nextTag><nextTag>2</nextTag></firstTag>

My Java code does not produce the desired output anymore, instead just this result:

 <?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>

After spending hours, I am pretty sure to already find out the reason:

 t.transform(new StAXSource(streamReader), new StreamResult(writer));

It is because, after the transform method being executed, the cursor will automatically moved forward to the next event. And in the code, I have this fraction:

while (streamReader.hasNext()) {
    streamReader.next();
                      ...
        t.transform(new StAXSource(streamReader), new StreamResult(writer));
                      ...
}

After the first transform, the streamReader gets directly 2 times next():

 1. from the transform method
 2. from the next method in the while loop

So, in case of this specific line XML, the cursor can never achive the second open tag . In opposite, if the input XML has a pretty print form, the second can be reached from the cursor because there is a space-event after the first closing tag

Unfortunately, I could not find anything how to do settings, so that the transformator does not automatically spring to next event after performing the transform method. This is so frustating.

Does anybody have any idea how I can deal with it? Also semantically is very welcome. Thank you so much.

Regards,

Ratna

PS. I can surely write a workaround for this problem(pretty print the xml document before transforming it, but this would mean that the input xml was being modified before, this is not allowed)

Community
  • 1
  • 1
  • Can you try getting rid of the `BufferedReader` and `InputStreamReaders`. They will mangle the encoding in some cases, and might be messing things up with newlines. – artbristol Jun 22 '14 at 15:33
  • Hallo artbristol, I just tested it, but still does not change anything. :-( – user3764388 Jun 22 '14 at 20:43

2 Answers2

2

As you elaborated did the transformation step proceed to the next create element if the element-nodes follow directly each other.

In order to deal with this, you can rewrite you code using nested while loops, like this:

        while(reader.next() != XMLStreamConstants.END_DOCUMENT) {
            while(reader.getEventType() == XMLStreamConstants.START_ELEMENT && reader.getLocalName().equals("nextTag")) {
                StringWriter writer = new StringWriter();
                // will transform the current node to a String, moves the cursor to the next START_ELEMENT
                t.transform(new StAXSource(reader), new StreamResult(writer)); 
                System.out.println(writer.toString());
            }
        }
Dag
  • 10,079
  • 8
  • 51
  • 74
  • 1
    Can't upvote this enough! After hours of trying other solutions, this is the first solution that can handle XML with and without whitespace between the tags. Thank you so much! – Christian Ciach Nov 30 '17 at 11:57
1

In case your xml file fits in memory, you can try with the help of the JOOX library, imported in like:

compile 'org.jooq:joox:1.3.0'

And the main class, like:

import java.io.File;
import java.io.IOException;
import org.joox.JOOX;
import org.joox.Match;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import static org.joox.JOOX.$;

public class Main {

    public static void main(String[] args) 
            throws IOException, SAXException, TransformerException {
        DocumentBuilder builder = JOOX.builder();
        Document document = builder.parse(new File(args[0]));

        Transformer transformer = 
                TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty("omit-xml-declaration", "no");

        final Match $m = $(document);
        $m.find("nextTag").forEach(tag -> {
            try {
                transformer.transform(
                        new DOMSource(tag), 
                        new StreamResult(System.out));
                System.out.println();
            }
            catch (TransformerException e) {
                System.exit(1);
            }
        });

    }
}

It yields:

<?xml version="1.0" encoding="UTF-8"?><nextTag>1</nextTag>
<?xml version="1.0" encoding="UTF-8"?><nextTag>2</nextTag>
Birei
  • 35,723
  • 2
  • 77
  • 82