5

I am looking for latest, memory efficient and high-performance java XML parsing API. I need to parse 3 MB to 5 MB XML files.

I did google on this and come to know about Sun Java Streaming XML Parser (SJSXP) and Woodstox is much faster than DOM & SAX. Both are using StAX API. *schema validation is not supported by these technologies.

Aalto XML processor is also implements StAX API.

I have not found concrete findings on performance on these technologies.

Which one will be best in context of memory efficient, high-performance and ease of use ?

Santosh
  • 782
  • 4
  • 15
  • 38
  • 4
    If you haven't been able to find any useful performance data for them, you could always test them yourself to see which one performs best for your particular situation. – Anthony Grist Aug 02 '12 at 08:29
  • 1
    The performance is likely to be dependant on your particular situation. When performance is critical I have written parser which were designed for the exact use case and this improve performance by a factor of 10x (But you have to build in a lot of assumptions which impacts robustness and flexibility) – Peter Lawrey Aug 02 '12 at 08:43
  • I was looking for industry standard Benchmarking data for above mentioned technologies. – Santosh Aug 02 '12 at 09:37
  • 1
    @Santosh mostly there haven't been benchmarks for two reasons: (a) there has been little if any development for past 3 years or so on Java/XML parsing, and (b) XML is losing its significance for data interchange (as opposed to textual markup), so fewer developers really care. – StaxMan Aug 02 '12 at 17:33
  • StaxMan (despite havig written some superb XML parsers) is completely wrong about (b). The financial services industry is exchanging billions of messages in formats like FpML and isn't going to change any time soon; and their performance is absolutely critical. What's different is these guys know that adding 10% to their hardware capability is much cheaper than looking for a 10% faster parser; also XML parsing is typically only 5% of the overall message processing pathlength. – Michael Kay Aug 03 '12 at 08:43

2 Answers2

3

Here are some more links that might be relevant:

As to performance: SJSXP is the slowest; it's just a repackage internals of Xerces, wrapped in Stax API. This has some negative effects on performance (since it's not really designed for pull parsing). Woodstox is bit faster; much faster for small documents and writing, less difference when parsing longer documents.

And Aalto is by far fastest of the three, especially for parsing. It is commonly 50% - 100% faster than either Woodstox or SJSXP. One downside is that it does not handle DTDs (and thereby not external entities; it handles pre-defined and character entities).

Disclaimer: I am author of Woodstox and Aalto; as well as contributor to SJSXP (bug fixes)

StaxMan
  • 113,358
  • 34
  • 211
  • 239
  • Looking for Aalto XML processor sample example for reference. Any help !! – Santosh Aug 23 '12 at 15:35
  • 1
    It implements Stax API, so there shouldn't be anything special there... just use Woodstox sample code? Or, since it also implements SAX (as does Woodstox 3.2 and above), can use SAX API as well. – StaxMan Aug 23 '12 at 18:58
  • Is Aalto still in development? I notice the last release was over 2 years ago. – pacoverflow Nov 21 '13 at 15:56
  • There isn't much on-going work due to lack of feedback, requests -- it works to degree users want. This is probably more due to JSON becoming dominant data format for new development. But project is not dead; and bug fixes for reported problems should be handled quickly. – StaxMan Nov 21 '13 at 23:09
0

Some helpful links for above queries :

http://www.developerfusion.com/article/84523/stax-the-odds-with-woodstox/ (June 2010)

http://www.ibm.com/developerworks/opensource/library/os-ag-renegade15/ (July 2007)

Performance benchmarking detail :

http://www.xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html (May 2007)

bdoughan
  • 147,609
  • 23
  • 300
  • 400
Santosh
  • 782
  • 4
  • 15
  • 38
  • 1
    The links in your answer are 2-5 years old and probably too old to accurately represent the current performance of the libraries discussed. – bdoughan Aug 02 '12 at 10:48
  • 1
    SJSXP has not changed really at all for past 3 years or so, only some bug fixes. And AFAIK, no performance changes. This is the biggest reason for lack of new benchmarks, as very little has changed. – StaxMan Aug 02 '12 at 17:28