How to write regex for XML which removes unescaped ampersand characters except CDATA?

Question

For example, I have XML like this:

<title>Very bad XML with & (unescaped)</title>
<title>Good XML with &amp; and &#x3E; (escaped)</title>
<title><![CDATA[ Good XML with & in CDATA ]]></title>

My task is to remove invalid ampersand characters from XML, but excluding those ampersand characters that are in CDATA. I found a regex that do it:

&(?!(?:apos|quot|[gl]t|amp);|#)

but unfortunately, it also removes ampersand characters from CDATA. How can I change this regex so that it satisfies my task?

Using regular expressions to manipulate structured formats is inherently broken. Use an XML-aware tool and perhaps inside of that a simple regex. — tripleee, Dec 11 '19 at 16:05
This must be done dynamically in the programm code before I pass it to the XML parser, because special symbols makes XML invalid for the parser. — proninyaroslav, Dec 11 '19 at 16:11
Which program? Written in which language? There is probably a way to do what you ask, but again, a monster regex is not a good approach. See also https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-no (which discusses HTML primarily; but the pertinent issues are the same). — tripleee, Dec 11 '19 at 16:14
RSS viewer that written on Java. I use SAX parser from std lib. — proninyaroslav, Dec 11 '19 at 16:17
Gross, that RSS feed isn't valid XML. People are just puking data in-between tags. If they want to do that, then they should be wrapping the text of those `title` elements in CDATA — Mads Hansen, Dec 11 '19 at 16:37
Basically `[tag:regex] + [tag:xml] = divide by zero error`: [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ctwheels, Dec 11 '19 at 16:41
@MadsHansen is right, it's not XML, but based on your comments, you know that. If you cannot fix the source, in the spirit of "I know, I know, but what can I do?" see [my answer below](https://stackoverflow.com/a/59290597/290085). — kjhughes, Dec 11 '19 at 17:03
You might consider using TagSoup to parse this feed http://vrici.lojban.org/~cowan/XML/tagsoup/ — Mads Hansen, Dec 11 '19 at 17:25

score 3 · Accepted Answer · answered Dec 11 '19 at 16:55

As you're aware, the "XML" isn't XML due to the unescaped & outside of CDATA. Thus, you're stuck having to pre-process without the benefit of an XML parser to differentiate between CDATA and PCDATA. That's rough, and regex isn't up to to the task for all the reasons that regex isn't up to parsing XML.

Here's one approach that can help:

Use regex to replace all isolated (not part of a character entity) & characters with &TEMP, including those within CDATA.
Using an XML parser on the now well-formed XML, restore the &TEMP occurences within CDATA to &.

See also: How to parse invalid (bad / not well-formed) XML?

General advice on parsing messy "XML"
Tolerant parsers
Regex's for matching invalid characters and &'s

jrook · Answer 2 · 2019-12-11T18:09:05.267

As a complement to the answer by @kjughes, writing a program to extract ampersand characters is fairly straightforward although a rather boring exercise. Since CDATAs cannot be nested, it is easy to mark the opening and closing of the tag.

Here is one such program:

    final int NOCDATA = -1;
    final int OPEN_CDATA0 = 0;   //!
    final int OPEN_CDATA1 = 1;   //![
    final int OPEN_CDATA2 = 2;   //![C
    final int OPEN_CDATA3 = 3;   //![CD
    final int OPEN_CDATA4 = 4;   //![CDA
    final int OPEN_CDATA5 = 5;   //![CDAT
    final int OPEN_CDATA6 = 6;   //![CDATA
    final int INSIDE_CDATA = 7;  //![CDATA[

    final int CLOSE_CDATA0 = 8;  //]

    String xml = "<title>Very bad XML with & (unescaped)</title>\n" +
            "<title>Good XML with &amp; and &#x3E; (escaped)</title>\n" +
            "<title><![CDATA[ Good XML with & in CDATA && ]]></title><title>Very bad XML with ![CDATA[&]] && (unescaped)</title>";

    StringBuilder result = new StringBuilder();
    Reader reader = new BufferedReader(new StringReader(xml));

    int r;
    int state = NOCDATA;

    while((r = reader.read()) != -1) {
        char c = (char)r;
        switch(c) {
            case '!':
                if(state == NOCDATA)
                    state = OPEN_CDATA0;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case '[':
                if(state == OPEN_CDATA0)
                    state = OPEN_CDATA1;
                else if(state == OPEN_CDATA6)
                    state = INSIDE_CDATA;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case 'C':
                if(state == OPEN_CDATA1)
                    state = OPEN_CDATA2;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case 'D':
                if(state == OPEN_CDATA2)
                    state = OPEN_CDATA3;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case 'A':
                if(state == OPEN_CDATA3)
                    state = OPEN_CDATA4;
                else if(state == OPEN_CDATA5)
                    state = OPEN_CDATA6;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case 'T':
                if(state == OPEN_CDATA4)
                    state = OPEN_CDATA5;
                else if(state != INSIDE_CDATA)
                    state = NOCDATA;
                break;
            case ']':
                if(state == INSIDE_CDATA)
                    state = CLOSE_CDATA0;
                else if(state == CLOSE_CDATA0)
                    state = NOCDATA;
                break;
            default:
                break;
        }
        if(state == CLOSE_CDATA0 && c != ']') {
            System.err.println("ERROR CLOSING");
            System.out.println(result);
            System.exit(1);
        }
        if(c !='&' || state == INSIDE_CDATA)
            result.append(c);
    }
    System.out.println(result);

this program outputs the following for the input in the question(a copy of the first string in the input has been appended to the end of the whole string with an additional CDATA tag to check for closing brackets):

<title>Very bad XML with  (unescaped)</title>
<title>Good XML with amp; and #x3E; (escaped)</title>
<title><![CDATA[ Good XML with & in CDATA && ]]></title><title>Very bad XML with ![CDATA[&]]  (unescaped)</title>

It is virtually a simple state machine built using switch/case statement. I have not tested this extensively and I suspect nesting CDATAs could make this fail (which doesn't seem to be allowed in the question anyways). I also did not bother adding the last > in CDATA close tag. But it should be easy to modify it to cover any failing cases. This answer provides the proper structure for the lexical analysis of CDATA tags.

How to write regex for XML which removes unescaped ampersand characters except CDATA?

2 Answers2