12

I have a specific format XML document that I will get pushed. This document will always be the same type so it's very strict.

I need to parse this so that I can convert it into JSON (well, a slightly bastardized version so someone else can use it with DOJO).

My question is, shall I use a very fast lightweight (no need for SAX, etc.) XML parser (any ideas?) or write my own, basically converting into a StringBuffer and spinning through the array? Basically, under the covers I assume all HTML parsers will spin thru the string (or memory buffer) and parse, producing output on the way through.

Thanks

edit

The xml will be between 3/4 lines to about 50 max (at the extreme)..

tharindu_DG
  • 8,900
  • 6
  • 52
  • 64
joe90
  • 538
  • 2
  • 5
  • 19

8 Answers8

11

No, you should not try to write your own XML parser for this.

SAX itself is very lightweight and fast, so I'm not sure why think it's too much. Also using a string buffer would actually be much less scalable then using SAX because SAX doesn't require you to load the whole XML file into memory to use it. I've used SAX to parse through multigigabyte XML files, which you wouldn't be able to do using string buffers on a 32 bit machine.

If you have small files and you don't need to worry about performance, look into using the DOM. Java's implementation can be kind of annoying to use (You create a document by using a DocumentBuilder, which comes from a DocumentBuilderFactory)

The code to create a document from a file looks like this:

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new FileInputStream("file.xml"));

(note that keeping a reference to your document builder will speed things up if you need to parse multiple files)

Then you use the function in org.w3c.dom.Document to read or manipulate the contents. For example getElementsByTagName() returns all the Elements with a certain tag name.

Chad Okere
  • 4,570
  • 1
  • 21
  • 19
  • 1
    I suspect that by "lightweight", Joe means "is easy to use"; SAX' callback-oriented API is not the most user-friendly. – Michael Borgwardt Jan 25 '10 at 18:25
  • 1
    I would have +'ed this up more if I could. SAX is about the most efficient way possible to read XML in Java. You'd be hard pressed to write a better correct XML parser. It should be possible to write the callback to produce the JSON directly, I would think. If there is little translation then it may be extremely tiny. – PSpeed Jan 25 '10 at 18:43
  • @Michael Borgwardt: I think using the DOM would be easier then writing your own parser :) – Chad Okere Jan 25 '10 at 18:49
  • But DOM is _definitely_ not light-weight. For this sort of translation from one format to another, SAX is ideal. Do it right and you could handle files that would never fit in memory. (You wouldn't need it in this case, but that's not the point.:)) – PSpeed Jan 25 '10 at 18:57
  • @PSpeed: IMHO SAX is not ideal, because event driven approach of SAX is harder to understand and use than pull parsing approach (of kXML parser or similar). – WildWezyr Jan 25 '10 at 19:22
  • Yes, JSON does have a toXML and you can make JSON.XMLtoJSON, but i need to add extra bits, and change a few bits around to satisfy the dojo requirements. As the quick bursts will be very strict in format, and typically be 3/4 lines line (50 at the most a (a recurring set of 3/4 line elements) holding in memory will not be too much of an issue.. Thanks again for the comments so far.. – joe90 Jan 25 '10 at 19:39
  • I think pull versus push comes down to personal experience, at some point. For data transformation, going from one format to another, push seems to result in less code generally. And it's usually more reusable. Mileage may vary with different use-cases. Plus, I have my own SAX utilities that add tag name based dispatch and an object stack which makes this stuff even more trivial sometimes. (http://meta-jb.svn.sourceforge.net/viewvc/meta-jb/trunk/dev/src/main/java/org/progeeks/util/xml/XmlReader.java?revision=3500&view=markup) I'd do that a little differently today but it works. – PSpeed Jan 25 '10 at 19:47
  • push + dispatch is nice (for example) when you are ignoring large portions of the input. – PSpeed Jan 25 '10 at 19:48
7

It really depends on the type of XML you're parsing. I wouldn't write your own parser when there's something already there to do the job for you.

The choice of SAX/DOM is really based on what you're trying to parse, see this for how to decide on which one to use:

http://geekexplains.blogspot.com/2009/04/sax-vs-dom-differences-between-dom-and.html

Even if you don't use SAX/DOM there are still simple options available to you, take a look at Simple : )

http://simple.sourceforge.net/

You may also want to consider STaX.

Chris K
  • 11,622
  • 1
  • 36
  • 49
Jonathan Holloway
  • 62,090
  • 32
  • 125
  • 150
3

Maybe you should look at kXML 2, a small XML pull parser specially designed for constrained environments, to access, parse, and display XML files for Java 2 Micro Edition-enabled devices. It works well with Java SE/EE too ;-). As it is designed for micro edition, it is really light-weight (small footprint) and IMHO really easy to use (much more easier than SAX/DOM etc. stuff).

From my own experience with kXML 2: I used it to parse XML files larger than 1 GB - Wikipedia dumps and I was very happy with performance / memory consumption etc.

At last ;-) - link: http://kxml.sourceforge.net/kxml2/

WildWezyr
  • 10,281
  • 6
  • 23
  • 28
  • Thanks,. will have a look at that :) as we will need a mobile version at some point too – joe90 Jan 25 '10 at 19:41
1

you can use Dom4j/xstream to read the xml into an equivalent java modal and then use JSONLIB to convert into JSON.

Teja Kantamneni
  • 17,402
  • 12
  • 56
  • 86
1

Do you really need to parse/manipulate any of the data in the XML document? If not, you could just create use an XSLT. Really simple, really fast.

Bal
  • 2,027
  • 4
  • 25
  • 51
0

Use a real XML parser. If you don't, you will probably get bitten when something changes. The document may be "very strict", but in two years time, something will probably get re-factored and it will change structure so that it parses to the same data structure with an XML parser and breaks a homebrew string parser.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • I see you point, but already in different areas (i.e the next step in the chain) they have changes bits from pure json to satisfy there requirements. – joe90 Jan 25 '10 at 19:41
  • So the not-really-JSON parser is set up to take a fall, but there is no need to compound the issue by introducing the same problem by using a not-really-XML parser. – Quentin Jan 25 '10 at 21:37
0

parsing on the backend and exposing JSON is probably the right way to go so that you would have general purpose JSON data that you can easily integrate with other sources, but if you have a simple message and this is the only place you think you'd be using JSON, you could try to do the parsing client side. Dojo has an experimental client-side XML parser

peller
  • 4,435
  • 19
  • 21
-2

Do you have to use XML?

I found that my own custom text format was much faster than either XML or JSON with any of the off the shelf packages - they were fast, but by controlling my own format and just doing String parsing I was able to cut the time in half against the fastest XML implementation.

Obviously this only works if you're fully in charge of formats and may not be appropriate to your situation, but for any others in this situation: don't think XML is the absolute fastest option you have. It's not.

Brian
  • 6,391
  • 3
  • 33
  • 49