1

I have a Java program (a war) that runs out of memory when manipulating a big XML file. The program is a REST API that returns the manipulated XML via a REST Controller.

First, the program gets an XML file from a remote URL. Then it replaces the values of id attributes. Finally, it returns the new XML to the caller via the API controller.

What I get from the remote URL is a byte[] body with XML data. Then, I convert it to a String. Next, I do a regexp search-replace on the whole string. Then I convert it back to a byte[].

I'm guessing that the XML now is in memory 3 times (the incoming bytes, the string and the outgoing bytes).

I'm looking for ways to improve this.

I have no local copies on the filesystem btw.

codesmith
  • 1,381
  • 3
  • 20
  • 42
  • 4
    Don't read the whole data into memory, use streams. Don't use regexp to process XML. Consider switching to stream approaches like SAX or StAX. – lexicore Mar 27 '18 at 08:49
  • @lexicore that sounds good. However, will that also work for I/O controllers? – codesmith Mar 27 '18 at 08:54
  • I don't know for sure, but I think it should. As long as you can read from a stream (the input file) and write to the response stream, this should work. – lexicore Mar 27 '18 at 08:57
  • I'd probably start by just "passing through" the whole XML file from input stream to the response stream. Next, I'd try to insert a dummy SAX filter in the middle so that input stream is parsed, goes through a filter and gets serialized to the output stream. Finally, tweak the filter to do the processing you want. If you only want to replace values of `id` attributes, I think, SAX will be sufficient as the processing is quite simple. – lexicore Mar 27 '18 at 09:02
  • 1
    The way I'd do it : 1) Do not get the remote file as a byte[], get it as a Stream (a HTTP connection is a Stream). 2) Do not convert it to a String, instanciate a Stax or SAX parser on it. 3) modify the id attributes on the fly, 4) write the result on your HttpServletResponse's outputsream, do not write to a byte[] or String either. A fully streamed implementation should only need a few kb of heap (and/or a few kbs above the biggest XML event). It's a lot more work, but it's a whole lot faster, accurate (XML-wise) & lighter. – GPI Mar 27 '18 at 11:52
  • Maybe XSLT+stream: don't load all the file on memory! https://stackoverflow.com/questions/4604497/xslt-processing-with-java – pdem Mar 27 '18 at 12:22
  • 1
    @pdem : there is no non-commercial streaming XSLT provider in Java. Only Saxon EE does this to my knowledge. (And generally speaking, XSLT is not a streamable process, only a subset of it is - but in this case, it is doable). See https://stackoverflow.com/questions/460895/what-is-the-most-efficient-java-based-streaming-xslt-processor Standard XSLT loads the XML tree in memory, one way or another. – GPI Mar 27 '18 at 13:37
  • @GPI oh yes, I agree, XSLT would load the stream in memory, there is a non commercial solution included in the jdk in the javax.xml.transform, but that's not the problem. Your comments could be an answer. – pdem Mar 28 '18 at 07:44

1 Answers1

0

You can delete the incoming bytes from memory after converting the bytes to String:

byte[] bytes = bytesFromURL;
String xml = new String(bytes);
{...manipulate xml}
bytes = null;
System.gc();
bytes = xml.getBytes();
Stempler
  • 1,309
  • 13
  • 25