As a complement to the answer by @kjughes, writing a program to extract ampersand characters is fairly straightforward although a rather boring exercise. Since CDATA
s cannot be nested, it is easy to mark the opening and closing of the tag.
Here is one such program:
final int NOCDATA = -1;
final int OPEN_CDATA0 = 0; //!
final int OPEN_CDATA1 = 1; //![
final int OPEN_CDATA2 = 2; //![C
final int OPEN_CDATA3 = 3; //![CD
final int OPEN_CDATA4 = 4; //![CDA
final int OPEN_CDATA5 = 5; //![CDAT
final int OPEN_CDATA6 = 6; //![CDATA
final int INSIDE_CDATA = 7; //![CDATA[
final int CLOSE_CDATA0 = 8; //]
String xml = "<title>Very bad XML with & (unescaped)</title>\n" +
"<title>Good XML with & and > (escaped)</title>\n" +
"<title><![CDATA[ Good XML with & in CDATA && ]]></title><title>Very bad XML with ![CDATA[&]] && (unescaped)</title>";
StringBuilder result = new StringBuilder();
Reader reader = new BufferedReader(new StringReader(xml));
int r;
int state = NOCDATA;
while((r = reader.read()) != -1) {
char c = (char)r;
switch(c) {
case '!':
if(state == NOCDATA)
state = OPEN_CDATA0;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case '[':
if(state == OPEN_CDATA0)
state = OPEN_CDATA1;
else if(state == OPEN_CDATA6)
state = INSIDE_CDATA;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case 'C':
if(state == OPEN_CDATA1)
state = OPEN_CDATA2;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case 'D':
if(state == OPEN_CDATA2)
state = OPEN_CDATA3;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case 'A':
if(state == OPEN_CDATA3)
state = OPEN_CDATA4;
else if(state == OPEN_CDATA5)
state = OPEN_CDATA6;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case 'T':
if(state == OPEN_CDATA4)
state = OPEN_CDATA5;
else if(state != INSIDE_CDATA)
state = NOCDATA;
break;
case ']':
if(state == INSIDE_CDATA)
state = CLOSE_CDATA0;
else if(state == CLOSE_CDATA0)
state = NOCDATA;
break;
default:
break;
}
if(state == CLOSE_CDATA0 && c != ']') {
System.err.println("ERROR CLOSING");
System.out.println(result);
System.exit(1);
}
if(c !='&' || state == INSIDE_CDATA)
result.append(c);
}
System.out.println(result);
this program outputs the following for the input in the question(a copy of the first string in the input has been appended to the end of the whole string with an additional CDATA tag to check for closing brackets):
<title>Very bad XML with (unescaped)</title>
<title>Good XML with amp; and #x3E; (escaped)</title>
<title><![CDATA[ Good XML with & in CDATA && ]]></title><title>Very bad XML with ![CDATA[&]] (unescaped)</title>
It is virtually a simple state machine built using switch/case statement. I have not tested this extensively and I suspect nesting CDATAs could make this fail (which doesn't seem to be allowed in the question anyways). I also did not bother adding the last >
in CDATA close tag. But it should be easy to modify it to cover any failing cases. This answer provides the proper structure for the lexical analysis of CDATA tags.