0

I am new to Java. I have this 2 GB xml file which I need to parse and store its data into a database.

Someone on StackOverflow recommended me to use Dom4j for long xml files. Parsing is doing good, but returned Document (by Dom4j) is very long and on iteration loads all DOM objects into memory (heap).

This results into out-of-memory anomalies. Can somebody please help me how to avoid such errors? Do we have some phenomenon in Java for on-demand heap allocation and deposition in Java?

Uwe Plonus
  • 9,803
  • 4
  • 41
  • 48

2 Answers2

5

You have two choices:

  1. reconfigure your JVM to allocate more maximum memory (via -Xmx2g or similar). See here for more info. This option is obviously limited also by your OS and the amount of free memory in your system.
  2. use a streaming API (such as SAX) that doesn't load all the XML into your memory at once, but rather streams it through your process, allowing you to analyse it without holding the entire doc in memory

The first option may help you immediately, and isn't specific to this question. The second option is the more scalable solution since it'll allow you to analyse documents of any size. Of course you need to worry about the memory consumption of the results of your analysis, but that's another matter entirely.

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • Thanks Brian, increasing heap size is of-course known to me and processing XML in chunks is good suggestion. But I need some generic solution for avoiding too much data getting loaded in heap. Related problem was there for a large table too - with around 15000 records. In that too some said to use cursors. But these solutions seems to be contextual - is there any generic solution or guidelines for avoiding out-of-memory anamolies? Also Dom4j has a SAX parser. – user2139064 Jun 10 '13 at 11:40
1

If you need to parse big XML files (and adding to the Java heap does not always work), you need a SAX parser which allows you to parse the XML stream instead of loading the whole DOM tree into memory.

You may also check SAXDOMIX

SAXDOMIX contains classes that can forward SAX events or DOM sub-trees to your application during the parsing of an XML document. The framework defines simple interfaces that allow the application to get DOM sub-trees in the middle of a SAX parsing. After handling, all DOM sub-trees become eligible for garbage collection. This solves the DOM scalability problem.

Juned Ahsan
  • 67,789
  • 12
  • 98
  • 136
  • Thanks Juned, I am using Dom4j and I think they also have a SAX parser. As one of the code snippet says - SAXReader reader = new SAXReader(); – user2139064 Jun 10 '13 at 11:44
  • With DOM problem is that the entire xml tree need to be loaded in memory. No matter how big heap size u set and if ur tree does not fit in it, you will end up with Out of memory error. SAX is better for parsing big xml, as you can read in chunks. I like SAXDOMIX as it mixes sax and dom to allow u parse in chunks and with ease. Try that. – Juned Ahsan Jun 10 '13 at 11:47
  • DOM (as output) is being used intentionally as many of the xml nodes are inter-dependent and fully SAX is making the processing really slow. Doesn't SAX parser in Dom4j do the same job? – user2139064 Jun 10 '13 at 11:55