10

Being new to XML parsing I'm trying to understand the different technologies. There is a confusing amount of different technologies for different needs:

  • W3C-DOM
  • XOM
  • jDom
  • JAXP
  • JAXB
  • DOM
  • SAX
  • StAX
  • TrAX
  • Woodstox
  • dom4j
  • Crimson
  • VTD-XML
  • Xerces-J
  • Castor
  • XStream
  • ...

Just to name a few.

DOM and SAX seem to be a low-level way for parsing and working on XML, so I decided to focus on the ones that get mentioned the most in different sources and are low-level:

DOM, SAX, JAXP.

I've read about parsers in general here on stackoverflow, JAXP-Tutorial from Oracle, XML-Parsing in general, and so on.

I've also tried some tutorials like this german one and others.

I'm grasping a little bit about DOM and SAX now, but the reason to use JAXP is still beyond me. It seems to be more of an interface to use DOM, SAX, ... internally, but why not use DOM or SAX directly?

What is the advantage of using JAXP in layman's-terms?

Community
  • 1
  • 1
hamena314
  • 2,969
  • 5
  • 30
  • 57
  • When I'm working with (manipulating/creating) xml i'm always using DOM, but that's just my personal opinion! I think it works quite well and provides all the features you need. – ParkerHalo Jan 05 '16 at 09:58
  • This may helps you https://jaxp.java.net/1.4/JAXP-FAQ.html – Jean-Baptiste Yunès Jan 05 '16 at 10:10
  • ParkerHalo: DOM seems to be a very intuitive way to work with XML. The main reason to not use DOM is often stated as the size of a document, but people only say "if the document is too big, use SAX instead of DOM", while never defining what "big" means - lines of code, document size in MB, number of xml-objects, ... and at which number this occurs. Are 20,000 lines considered big, or 1,000,000 and so on. – hamena314 Jan 05 '16 at 10:23
  • @hamena314 You'll notice what's big when you run out of memory (which won't take that much time with DOM). As for JAXP, it's just an old term (Java Api for XML Processing) to refer to the SAX/DOM/StAX parsers. You can't really "use" JAXP. – Kayaman Jan 05 '16 at 10:29
  • @Kayman Is it something I HAVE to notice (as the environment is different each time I use a parser), or are there "rules of thumb" i.e. more than X MB, more than Y lines of code, etc.? Because noticing after doing all of the implementation seems to be too late. – hamena314 Jan 06 '16 at 08:37

2 Answers2

11

(Although you haven't said so explicitly, your question seems to relate exclusively to the Java world, and this answer reflects that.)

JAXP is a set of interfaces covering XML parsing, XSLT transformation, and XML schema validation. If we just focus on the XML parsing side, its main contribution is to provide a mechanism for locating an XML parser implementation, so your source code isn't locked into a particular product. Frankly that's of limited value these days; the only two SAX/DOM parsers in common use are the one embedded in the JDK, and Apache Xerces. Apache Xerces is better in every respect except that you need to download it separately.

As for the other parsing interfaces, they break down into two categories: event-based APIs and tree-based APIs. Tree-based APIs are much easier to work with, but can use a lot of memory when handling large documents.

The two dominant event-based APIs are SAX (push) and StAX (pull). Pull parsing is something many programmers find easier because you can use the program stack to maintain state information; unfortunately though the StAX API is a bit buggy - different implementations have fixed its gaps in different ways. The most complete and reliable implementation of StAX is the Woodstox parser; the most complete and reliable implementation of SAX is Apache Xerces. But don't attempt to use an event-based parsing approach unless your application really needs that level of performance (and unless you have the level of experience needed to avoid losing all the performance gains at the application level.)

For tree-based APIs, the DOM remains dominant solely because it was defined by W3C and is implemented in the JDK, and is therefore perceived as "standard"; also it's the one mentioned in all the books on the subject. However, of all the tree models, it is unquestionably the worst designed (mainly because it predates the introduction of namespaces). Alternatives include JDOM2, DOM4J, XOM, and AXIOM. I tend to recommend JDOM2 or XOM.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • You're right, I have changed my title in order to have "Java" in it. So JAXP is some sort of box that contains DOM/SAX (XML Parsing), XSLT, ...? And if I use DOM / SAX directly, I am indirectly "using" JAXP, as DOM and SAX originate from JAXP? I've read some reviews about XOM and it seems to be quite good, but the licence (LGPL) might make it hard for me to use in my projects. But I have to read more about that. – hamena314 Jan 06 '16 at 08:24
  • Note that the SAX/DOM implementation in the JDK is based on Apache Xerces, and it is actually better maintained than the original. – Andreas Veithen Jan 06 '16 at 09:10
  • @AndreasVeithen, Yes, it is a fork of the original. But it has some very serious bugs which have been known for donkey's years (well, at least since 2009) and have never been fixed. You don't even get any kind of acknowledgement when you report them, they just go into a black hole. – Michael Kay Jan 06 '16 at 09:38
  • @hamena314, I wouldn't describe JAXP (specifically the XML parsing part of JAXP) as "containing" DOM/SAX services, more as a kind of router that enables you to find a supplier of DOM/SAX services. The distinction is that if you know the class name of the DOM/SAX implementation you want to use, and you don't want portability across different implementations, then you can usually bypass the JAXP search mechanism. – Michael Kay Jan 06 '16 at 09:44
  • @AndreasVeithen for an example of such a bug see http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8145969. Although this was reported recently, it is a very old bug, and I reported it at least five years ago, though I cannot find my previous reports in the Oracle database (only an email from me to a customer telling them I had reported it). – Michael Kay Jan 06 '16 at 10:06
  • Update: the Oracle bug tracker claims that this bug is fixed in JDK 9. At last. – Michael Kay Nov 24 '17 at 14:09
  • Update: from JDK 9 I am no longer advising people against using the JDK version of Xerces; the major problems that existed in earlier JDK versions appear to be fixed. – Michael Kay Jun 03 '19 at 16:01
1

JAXP is just Sun's (now Oracle's) name for a collection of SAX and DOM classes they bundle with the JDK. If you're using JAXP, you're also using SAX and/or DOM. It's not a different thing.

JAXP also adds a few helper classes in the javax.xml.parsers package that fill gaps in SAX 1 and DOM 1, i.e. old versions of these libraries from 15+ years ago. However these are not necessary with SAX2/DOM3 that are used today. Worse yet, javax.xml.parsers classes such as DocumentBuilderFactory and SAXParserFactory are designed in a confusing way (they're not namespace aware by default) so they are almost always used incorrectly. Then developers come here to ask why their program doesn't do what they think it should. Just ignore these classes and use XMLReaderFactory (SAX 2) or DOMImplementationLS (DOM 3) instead.

  • Namespace aware means, that in an XML document a `company` might have an XML-element named `adress` and later in the document a `employee` might have an XML-element named also `adress`? Is that, what you are refering to? And despite using differeng Factory(?) classes like `DOMImplementationLS` instead of `DocumentBuilderFactory`, are there any other differences in usage? – hamena314 Jan 06 '16 at 08:30
  • @ElliotteRustyHarold I have always taken the view that JAXP is an interface, but when you say that Oracle/Sun use the name to refer to "a collection of SAX and DOM classes" (that is, a specific implementation), I think you are right. They have a very bad track record at confusing the interface with their specific implementation. – Michael Kay Jan 06 '16 at 10:00
  • 1
    @hamena314 Besides the builder and factory classes, there are NO differences in usage between JAXP SAX and regular SAX. They are the *same* classes. They are just bundled with the JDK. Same answer for DOM. Namespace aware, in this context, has to do with how the parser passes local and qualified names to which methods. You always want this turned on and the javax.xml.parsers classes turn this off by default. :-( – Elliotte Rusty Harold Jan 06 '16 at 15:35