XML splitting of BIG file using Java

Question

I'm trying to create a java program that will split the selected XML file.

XML file data sample:

<EmployeeDetails>
<Employee>
<FirstName>Ben</FirstName>
</Employee>
<Employee>
<FirstName>George</FirstName>
</Employee>
<Employee>
<FirstName>Cling</FirstName>
</Employee>
<EmployeeDetails>

And so on, I have this 250mb XML file ant it always pain in the ass to open it external program and manually split it to be able to be readable with the others (not all laptop/desktop can open such large file). So I decided to create a Java Program that will have this function: -Select XML File (already done) -Split file based on # of tags eg.(Current file has 100k of tags I'll ask the program user on how Employee he/she wants for the splitted file. eg. (10k per file) -Split the file ( already done)

I just want to ask for help on how can I possibly do the 2nd task, already in 3-4 days checking on how can I possibly do this or is it even feasible ( in my mind of course it is).

Any response will be appreciated.

Cheers, Grimm.

For Java you have two choices - a DOM (document object model) where the whole file is read into memory. That will be a bit simpler to implement but will require a reasonable amount of memory - a 1GB JVM should be sufficient if the program isn't doing much else. A SAX (streaming) model could handle the file even if it became 100GB - it reads the file a bit at a time and has callbacks when, for example, it sees a new tag. [This post](http://stackoverflow.com/questions/6828703/what-is-the-difference-between-sax-and-dom) goes into a bit more detail. — stdunbar, Jun 19 '16 at 21:31
Thanks for the response. @stdunbar, thanks for the nice idea but I think I'll try the SAX approach since via DOM will burden to the low end computers. — Grimmjow, Jun 22 '16 at 07:22
@MichaelKay, I already try to create an XSLT transformation that will split file, I used oxygen but amazingly it needs a good specs of computer as well. Thanks a lot for your inputs XSLT you've provided below is amazing. — Grimmjow, Jun 22 '16 at 07:24

score 2 · Accepted Answer · answered Jun 20 '16 at 10:14

Assuming a flat structure where the root element of the document R has a large number of children named X, the following XSLT 2.0 transformation will split the file every Nth X element.

<t:transform xmlns:t="http://www.w3.org/1999/XSL/Transform"
  version="2.0">
  <t:param name="N" select="100"/>
  <t:template match="/*">
    <t:for-each-group select="X" 
                      group-adjacent="(position()-1) idiv $N">
      <t:result-document href="{position()}.xml">
        <R>
          <t:copy-of select="current-group()"/>
        </R>
      </t:result-document>
   </t:for-each-group>
  </t:template>
</t:transform>

If you want to run this in streaming mode (without building the source tree in memory), then (a) add <xsl:mode streamable="yes"/>, and (b) run it using an XSLT 3.0 processor (Saxon-EE or Exselt).

This is working as expected as well. Now I have 2 options your comment and the below one. But apparently checking 2 answer are not possible. — Grimmjow, Jun 22 '16 at 13:56

score 0 · Answer 2 · answered Jun 20 '16 at 10:43

0

A simple solution is in order. If the XML always has those line breaks as shown, XML processing is not needed.

Path originalPath = Paths.get("... .xml");
try (BufferedReader in = Files.newBufferedReader(originalPath, StandardCharsets.UTF_8)) {
    String line = in.readLine(); // Skip header line(s)

    line = in.readLine();
    for (int fileno; line != null && !line.contains("</EmployeeDetails>"); ++fileno) {
        Path partPath = Paths.get("...-" + fileno + ".xml");
        try (PrintWriter out = new PrintWriter(Files.newBufferedWriter(partPath,
                StandardCharsets.UTF_8))) {
            int counter = 0;
            out.println("<EmployeeDetails>"); // Write header.
            do {
                out.println(line);
                if (line.contains("</Employee>") {
                    ++counter;
                }
                line = in.readLine();
            } while (line != null && !line.contains("</EmployeeDetails>")
                    && counter < 1000);
            out.println("</EmployeeDetails>");
        }
    }
}

answered Jun 20 '16 at 10:43

Joop Eggen

107,315
7
83
138

This is quite good and working but apparently as my same issue in the above comments, not all computers can open heavy XML file expecting this error as I've experienced before "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space" – Grimmjow Jun 22 '16 at 13:28
That is weird, try using just a BufferedWriter, as PrintWriter unfortunately sweeps exceptions under the carpet. Do you something with the line read beside? – Joop Eggen Jun 22 '16 at 13:48
Another idea: use gz compression of the xml to `xxx.xml.gz` and use `new InputStreamReader(new GZipInputStream(...`. – Joop Eggen Jun 22 '16 at 13:54
None, as I said this is working that heap space error can be corrected by adjusting the run-configuration on my screen problem I'm trying to solve is on program implementation, although it can accept small files with this code but I'm sure this cannot accommodate gigabyte XML file. Though, thanks a lot for your help very much appreciated. – Grimmjow Jun 22 '16 at 13:55
Maybe at some point, newlines are missing and a MB line is read. Good luck. – Joop Eggen Jun 22 '16 at 14:04

XML splitting of BIG file using Java

2 Answers2