0

I am running into some out of memory exceptions when reading in very very large XML strings and converting them into a Document object.

The way I am doing this is I am opening a URL stream to the XML file, wrapping that in an InputStreamReader, then wrapping that in a BufferedReader.

Then I read from the BufferedReader and append to a StringBuffer:

StringBuffer doc = new StringBuffer();
BufferedReader in = new BufferedReader(newInputStreamReader(downloadURL.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
  doc.append(inputLine);
}

Now this is the part I am having an issue with. I am using toString on the StringBuffer to be able to get the bytes to create a byte array which is then used to create a ByteArrayInputStream. I believe that this step is causing me to have the same data in memory twice, is that right?

Here is what I am doing:

byte xmlBytes[] = doc.toString().getBytes();
ByteArrayInputStream is = new ByteArrayInputStream(xmlBytes);
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
Builder xmlBuilder = new Builder(xmlReader,false);
Document d = xmlBuilder.build(is);

Is there a way I can avoid creating duplicate memory (if I am doing so in the first place) or is there a way to convert the BufferedReader straight into a ByteArrayInputStream?

Thanks

Seephor
  • 1,692
  • 3
  • 28
  • 50
  • You're not creating "duplicate memory". You hold everything in your memory. Where do you want this StringBuffer to go in the end? – Alexey Soshin Nov 09 '17 at 20:33
  • What I meant was, am I creating the same data in memory twice? I ultimately just need the data in the Document object. – Seephor Nov 09 '17 at 20:40

2 Answers2

0

Here is how you can consume an InputStream to create a Document using a DOM parser:

DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document document = builder.parse(inputStream);

This creates less intermediary copies. However, if the XML document is very large, instead of parsing it completely in memory, the best solution is to use a StAX parser.

With a StAX parser, you don't load the entire parsed document in memory. Instead, you handle each element found sequentially (and the element is thrown away immediately).

Here is a good explanation: Java: Parsing XML files: DOM, SAX or StAX?

There are also SAX parsers, but it's much easier to use StAX. Discussion here: When should I choose SAX over StAX?

Duarte Meneses
  • 2,868
  • 19
  • 22
  • But how do I create the correct ByteArrayInputStream without calling toString on the StringBuffer? What you suggested looks the same as what I have already in the second section. Or are you saying I can use the original InputStreamReader that is wrapped in the BufferedReader instead of using a StringBuffer? It's unclear. – Seephor Nov 16 '17 at 21:08
0

If your XML (or JSON) file is large then it is not a good idea to load the whole content to memory because as you mentioned the parsing process consumes huge memory.

This issue can be more serious in case of more users (I mean more then one thread). Just imagine what will happen if your application needs to serve two, ten or more parallel requests...

The best way to process huge file as a stream and after you read the payload from the stream you can close it without read the stream till the end. It is more faster and memory friendly solution.

Apache Commons IO can help you to do the job:

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}

The another way to handle this issue is to split your XML file to parts and then you can process the smaller parts without any issue.

zappee
  • 20,148
  • 14
  • 73
  • 129
  • It is already reading line by line for the parsing part, it is just when converting to a byte array output stream that it creates a new string with the same contents as the buffer. Unless you mean line by line when converting the bytes to the XML. In that case, how would you do this approach for that with XMLReader – Seephor Nov 16 '17 at 19:33