I strongly suggest that you don't use regular expressions to parse XML, and in this case, you shouldn't use regex at all.
What you need is a good XML parser/streamer framework, such as SAX or StaX (due to the size of the file, I would go with the latter).
You would basically push each and every streaming event you read to a writer.
Once you identify a characters
event while parsing the file with your reader instance, instead of directly writing it, you replace each symbol with its entity, and write the replaced String
instead of the original one.
Note: here is an official StaX tutorial to get you started. Here is the JEE5 reference page, which contains additional information.
Why do that instead of applying a Pattern
and parsing the whole file with a BufferedReader
?
- Because the performance would be awful (re-matching on the
Pattern
for each line of your 5MB file)
- Because your
Pattern
would have to be very complex (so, unreadable, and again, bad performance)
More SO documentation on regex XML parsing VS proper XML parsing here.
Edit
I haven't considered the case of a huge, entirely malformed XML file.
In this case, a streamer framework might be impossible to use, since the file being streamed is not valid XML in the first place.
If you have exhausted every other choice, you want to pinch your nose shut, use a BufferedReader
, and do something like this (needs a lot of elaboration - don't take it literally):
String killMe = "<element>blah < > &</element>";
// only valuable piece of info here: checks for characters within a node
// across multiple lines - again, needs a lot of work
Pattern please = Pattern.compile(">(.+)</", Pattern.MULTILINE);
Matcher iWantToDie = please.matcher(killMe);
while (iWantToDie.find()) {
System.out.println("Uugh: " + iWantToDie.group(1));
System.out.println("LT: " + iWantToDie.group(1).replace("<", "<"));
System.out.println("GT: " + iWantToDie.group(1).replace(">", ">"));
System.out.println("AND: " + iWantToDie.group(1).replace("&", "&"));
}
Output:
Uugh: blah < > &
LT: blah < > &
GT: blah < > &
AND: blah < > <
Something Else