0

i have a problem similar to these ones described in these topics 1) Replace >, <, & chars that appear inside XML nodes 2) Regular expression to match ">", "<", "&" chars that appear inside XML nodes

and I'm looking for a solution working in Java. In practice I have a huge XML file (~5 MB) and I want to replace special characters with their respective entities (escaped characters), without changing the XML tag. A tipical example should be:

<tag><anothertag>& < > </anothertag></tag> (before)
<tag><anothertag>&amp; &lt; &gt; </anothertag></tag> (after).

Thanks in advance

Community
  • 1
  • 1
  • You want to replace < with < ? Why?? Also your before and after statements are identical. – Philip Whitehouse Sep 30 '13 at 20:54
  • 1
    Is there any particular reason that you don't just use a CDATA block? By the way, I wouldn't use regex for this. @Philip: as far as I interpret the question, OP actually want to do it the other way round (i.e. make syntactically invalid XML syntactically valid). This is at least mentioned in title, links and code example. – BalusC Sep 30 '13 at 20:54
  • Your explanation suggests you want to go `<` to `<`, while your example shows the opposite transition. Can you clarify which you actually want? – thegrinner Sep 30 '13 at 20:55
  • 1
    Based on a response to 2, I'm not sure this is tractable: `Something
    Something Else
    ` would convert to `Something<br/>Something Else` even if you accounted for nesting. I think you should fix whatever generates this code. (Thanks @BalusC )
    – Philip Whitehouse Sep 30 '13 at 20:58
  • XML is not regular language. It means: please, do not try to feed it into a regex. You will gain only pain. – Display Name Sep 30 '13 at 21:01
  • @PhilipWhitehouse: the XML is generated from a remote node, not by me. The file isn't indented and is very hard to read (due to size). I was trying to indent it using Transform class, but I got several errors about the presence of some characters like "&", "<", ">". – Tinez Ridan Sep 30 '13 at 21:09
  • @BalusC: sorry for the misunderstanding. – Tinez Ridan Sep 30 '13 at 21:09
  • If you can, ask whoever is in charge of the remote node to fix the XML output. This is really the preferred solution. If there's no possibility of regenerating the huge XML document with correct escaping, then you will probably have to write a custom parser to fix the document. You should not use the custom parser on any input except the specific broken data from this node, because it will do Bad Things to real XML from well-behaved services. – Mike Clark Sep 30 '13 at 21:49

2 Answers2

2

I strongly suggest that you don't use regular expressions to parse XML, and in this case, you shouldn't use regex at all.

What you need is a good XML parser/streamer framework, such as SAX or StaX (due to the size of the file, I would go with the latter).

You would basically push each and every streaming event you read to a writer.

Once you identify a characters event while parsing the file with your reader instance, instead of directly writing it, you replace each symbol with its entity, and write the replaced String instead of the original one.

Note: here is an official StaX tutorial to get you started. Here is the JEE5 reference page, which contains additional information.

Why do that instead of applying a Pattern and parsing the whole file with a BufferedReader?

  • Because the performance would be awful (re-matching on the Pattern for each line of your 5MB file)
  • Because your Pattern would have to be very complex (so, unreadable, and again, bad performance)

More SO documentation on regex XML parsing VS proper XML parsing here.

Edit

I haven't considered the case of a huge, entirely malformed XML file. In this case, a streamer framework might be impossible to use, since the file being streamed is not valid XML in the first place.

If you have exhausted every other choice, you want to pinch your nose shut, use a BufferedReader, and do something like this (needs a lot of elaboration - don't take it literally):

String killMe = "<element>blah < > &</element>";
// only valuable piece of info here: checks for characters within a node
// across multiple lines - again, needs a lot of work
Pattern please = Pattern.compile(">(.+)</", Pattern.MULTILINE);
Matcher iWantToDie = please.matcher(killMe);
while (iWantToDie.find()) {
    System.out.println("Uugh: " + iWantToDie.group(1));
    System.out.println("LT: " + iWantToDie.group(1).replace("<", "&lt;"));
    System.out.println("GT: " + iWantToDie.group(1).replace(">", "&gt;"));
    System.out.println("AND: " + iWantToDie.group(1).replace("&", "&amp;"));
}

Output:

Uugh: blah < > &
LT: blah &lt; > &
GT: blah < &gt; &
AND: blah < > &lt;
Community
  • 1
  • 1
Mena
  • 47,782
  • 11
  • 87
  • 106
0

This is a tough one, because as far as I know the fact that there are tokens like >< as a part of the content of your XML, you have invalid XML. My best advice is to find a good xml parser like http://dom4j.sourceforge.net/dom4j-1.6.1/ and hope it can handle your issues.

Michael Hoyle
  • 249
  • 2
  • 9