0

I want to do some manipulation on xml content in Java. See below xml

From Source XML:
<ns1:Order xmlns:ns1="com.test.ns" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <OrderHeader>
        <Image>Image as BinaryData of size 250KB</Image>
    </OrderHeader>
</ns1:Order>

Target XML:
<OrderData>
    <OrderHeader>
        <Image>Image as BinaryData of size 250KB</Image>
    </OrderHeader>
</OrderData>

As shown, I have Source xml and I want target xml for that .. The only difference we can observe is root_element "ns1:Order" is replace with "OrderData" in target xml.

Fyi, OrderHeader has one sub-element Image which holds binary image of 250KB (so this xml going to be large one) .. also root element of target xml "OrderData" is well-known in advance.

Now, I want to achieve above result in java with best performance .. I have Source xml content already as byte[] and I want target xml content also as byte[] .. I am open to use Sax parser too.

Please provide the solution which has best performance for doing above stuff.

Thanks in advance, Nurali

  • 2
    For such a simple transformation on a large file, you should probably go for a SAX parser. Putting your data into byte[]'s does not magically improve your performance. – Mathias Schwarz Mar 19 '12 at 13:12
  • Furthermore, this site is not a code factory. Did you try anything so far? And did you read the FAQ? – home Mar 19 '12 at 13:14
  • Thanks for reply .. :) I already achieve it through String manipulation and also with RegEx .. but I thought there should be better way .. b'coz I concerned about performance for this solution .. I thought rather than working on String, I should work on byte[] or char[] .. so, I also dirty my hand with Sax but and still digging on Sax to achieve the solution .. what I am looking is some guidance like what is the better way string/regex/sax/OrSomethingElse? .. and if possible the pseudo logic. Thanks, Nurali – nurali.techie Mar 19 '12 at 13:26
  • 1
    RegExp is probably the worst option to go for unless you have very tight control of how the documents look now and in all future. Your XML file is _not_ large and will _not_ take up a lot of memory (little more than 250KB) if you simply load it in the DOM framework and change whatever you need to. – Mathias Schwarz Mar 19 '12 at 14:37
  • I hv checked performance with different alternative .. here is the actual numbers .. String manipulation - 2 ms SAX takes - 25 ms Stax takes - 60 ms XSLT takes - 200 ms Only considering performance, string_manipulation looks best .. but with risk that the transformation logic can break in future .. Sax is fine but still i need to put lots of efforts to come up with final algo .. Stax look better both in speed and the easy to impl .. XSLT out of choice .. So its Stax which I hv choose .. Thanks all for you input and comments .. – nurali.techie Mar 24 '12 at 09:23

4 Answers4

1

Do you mean machine performance or human performance? Spending an infinite amount of programmer time to achieve a microscopic gain in machine performance is a strange trade-off to make these days, when a powerful computer costs about the same as half a day of a contract programmer's time.

I would recommend using XSLT. It might not be fastest, but it will be fast enough. For a simple transformation like this, XSLT performance will be dominated by parsing and serialization costs, and those won't be any worse than for any other solution.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Not much will beat direct bytes/String manipulation, for instance, a regular expression.

But be warned, manipulating XML with Regex is always a hot debate

Community
  • 1
  • 1
Bruno Grieder
  • 28,128
  • 8
  • 69
  • 101
  • A correct regex solution will probably be very slow. Only use this approach if you don't care about correctness. – Michael Kay Mar 19 '12 at 15:29
  • @Michael 99% (or more) of our XML processing is done using XSLTs run by Saxon for a lot of obvious reasons, so I am not going to argue against this... but there is always this case, where you have a big number of large files on which you must quickly do a small, simple, well defined change (the OP case I understand). Then suddenly, cpu and memory consumption may become an issue; the argument is then maintenance/process risk vs speed, not technology vs technology. – Bruno Grieder Mar 19 '12 at 16:57
  • well, yes, there isn't a rule in the book that I won't bend in extreme situations. But there's no evidence in the post that this requirement is extreme enough to justify such desperate measures. – Michael Kay Mar 23 '12 at 00:22
0

I used XLST to transform XML documents. That's another way to do it. There are several Java implementations of XLST processors.

sarahTheButterFly
  • 1,894
  • 3
  • 22
  • 36
0

The fastest way to manipulate strings in Java is using direct manipulation and the StringBuilder for the results. I wrote code to modify 20 mb strings that built a table of change locations and then copied and modified the string into a new StringBuilder. For Strings XSLT and RegEx are much slower than direct manipulation and SAX/DOM parsers are slower still.

Michael Shopsin
  • 2,055
  • 2
  • 24
  • 43
  • There is no reason why a SAX-parser should be any slower that RegExp. SAX requires nothing but a simple linear scan through the file. – Mathias Schwarz Mar 19 '12 at 14:33
  • SAX parsers are faster than DOM parsers but they still seem to run some additional overhead compared to RegEx or direct string manipulation. The good news is that SAX parsers have a fixed penalty while DOM parsers become much slower for larger XML files. – Michael Shopsin Mar 19 '12 at 15:05
  • "Parsing" correctly with ReqEx or string hacking is impossible, so it makes little sense to compare them this way. – Mathias Schwarz Mar 19 '12 at 15:10
  • Often changes to text can be made without fully parsing the file, which saves time. For example if Bob needs to be changed to Alice you can alter the text without worrying that its xml formatted as Bob. – Michael Shopsin Mar 19 '12 at 15:14
  • I don't want to get into a long and pointless discussion about this, but in general XML can contain character escapes and all sorts of things that make it impossible to make a string replace work correctly. On top of that the 'Bob' could be a substring elsewhere in the file. If you only make a small change to the XML file and then write it back, IO is going to be the bottle neck regardless... – Mathias Schwarz Mar 19 '12 at 15:20
  • Agreed in general string replacement can occur out of context. In my case speed was essential and the strings to replace were url paths which were not likely to occur by accident. – Michael Shopsin Mar 19 '12 at 15:26