Remove <> symbols from CDATA XML tag using regex

Question

Lets say I have an XML document like this:

<records>
    <record>
        <name>Jon</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
        </comment>
    </record>
    <record>
        <name>Jane</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 siblings ]]]>
        </comment>
    </record>
</records>

I need to convert this document to a JSON object using xml2js, but I need to remove the < and > symbols for it to avoid breaking the JSON conversion process.

What I have tried

Since I understand that I need to remove these symbols before passing the XML string to the xml2js parser I have tried variations of the solutions described in the following cases:

I am successfull in matching the entire contents of the CDATA tag but not able to match the specific characters that I need to remove. This has to be accomplished in a single regex so I can pass it to the XML to JSON parser.

Any help or pointers would be greatly appreciated. Thanks in advance.

Additional Info

Adding this since the question was voted down due to lack of research evidence.

I tried modifying a regex rule I found in one of the references I mentioned. This is the rule.

\[CDATA\[(.*?)\]\]>`

This matches the entire contents of teh CDATA tag. This is helpful, but what I need to to replace/remove content within the CDATA tags. Here is how it looks on the regex editor.

I then proceeded to modify the rule to match either < or > Here is the rule that I tried.

\[CDATA\[(.*?)[<>]*\]\]>

This rule matches the following content (not just the <> signs).

    [ Patient with > 2 and < 5 siblings]

Here is how it looks on the regex editor.

I hope this give more clarity about what I am trying to accomplish.

Edit 2:

Here is the error triggered by the code. The relevant error message is invalid closing tag.

Here is line 38 of import.js as referenced in the error trace.

const jsonXml = await parseStringPromise(xml).then((res) => res);

This line uses xml2js to parse the XML document and convert it to a JSON object. Because the CTAG contains the <> symbols, I assume that the parser thinks it is part of an XML tag that is not closed properly.

In what way does your conversion break? Isn't `comment: ["\n [ Patient with > 2 and < 5 siblings]\n "]` the content you expect? — Martin Honnen, Sep 05 '21 at 20:28
Hey, I am in the process of editing the question to show what I have tried. I have considered two options: 1) Remove the <> symbos all together 2) Convert them to HTML entities The XM to JSON conversion breaks because the XML tags use theses symbols. I am looking for ways to handle these cases. — ivan quintero, Sep 05 '21 at 20:58
Are you trying to process the XML document directly with regex? If so, why is this tagged `xslt`? -- P.S. Do not try to process the XML document directly with regex - see here why: https://stackoverflow.com/a/1732454 — michael.hor257k, Sep 05 '21 at 21:22
Please show your code using that library and the result you get versus the one you want or explain in which way the "conversion breaks", which error you get. — Martin Honnen, Sep 05 '21 at 21:41
Hello @michael.hor257k. I tagged the question as XSL because this XML file will be styled by an XSL stylesheet. However, the problem is not XSL related Apologies for that. Martin, will edit question with request info in a few minutes. — ivan quintero, Sep 05 '21 at 22:07
Finished editing the question with the requested data. I really appreciate your time in helping with my question. @michael.hor257k — ivan quintero, Sep 05 '21 at 23:04
If you are using PCRE: `(?:\G(?!^)|<!\[CDATA\[\[)(?:(?!<!\[CDATA\[\[|]]>)[^<>])*\K[<>]`, see [the regex demo](https://regex101.com/r/MCn44x/1/). — Wiktor Stribiżew, Sep 05 '21 at 23:12
You should address this to Martin Honnen. I know nothing about `xml2js`. The only thing I can tell you that your input is a well-formed XML document and if your tool cannot handle it properly then your tool is broken. You might want to try and pre-process the XML by applying an XSLT *identity transform* to it; this will convert the CDATA sections to escaped text (i.e. `Patient with > 2 and < 5 siblings`) which your tool might be able to handle properly. But ultimately your best choice, IMHO, is to use another method to get the wanted result. Such as XSLT, for example. — michael.hor257k, Sep 05 '21 at 23:16
Thanks Michael. This process happens before the XSLT stylesheet comes into play. I will submit an issue on the `xml2js` repository. Thanks to all for the feedback — ivan quintero, Sep 05 '21 at 23:54
Hey @WiktorStribiżew, thanks for the solution. Unfortunately, this has to work with Javascript (EMACScript) — ivan quintero, Sep 06 '21 at 00:01
Well, that is still [easy](https://regex101.com/r/MCn44x/2/), use `(?<=<!\[CDATA\[\[(?:(?!<!\[CDATA\[\[|]]>).)*)[<>]`. — Wiktor Stribiżew, Sep 06 '21 at 07:04

score 0 · Answer 1 · answered Sep 06 '21 at 08:32

I can't reproduce the parsing problem (using the current version of the library, 0.4.23):

var xml2js = require("xml2js")

var xml = `<records>
    <record>
        <name>Jon</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
        </comment>
    </record>
    <record>
        <name>Jane</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 siblings ]]]>
        </comment>
    </record>
</records>`;

const jsResult = await xml2js.parseStringPromise(xml).then((res) => res);

const jsonResult = JSON.stringify(jsResult);

console.dir(jsonResult);

That gives

{"records":{"record":[{"name":["Jon"],"surname":["Doe"],"dob":["2001-02-01"],"comment":["\n            [ Patient with > 2 and < 5 siblings]\n        "]},{"name":["Jane"],"surname":["Doe"],"dob":["2001-02-01"],"comment":["\n            [ Patient with > 2 siblings ]\n        "]}]}}

which validates and formats fine at jsonlint.com as

{
    "records": {
        "record": [{
            "name": ["Jon"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 and < 5 siblings]\n        "]
        }, {
            "name": ["Jane"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 siblings ]\n        "]
        }]
    }
}

or you can use const jsonResult = JSON.stringify(jsResult, null, 4); also giving a readable output

{
    "records": {
        "record": [{
            "name": ["Jon"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 and < 5 siblings]\n        "]
        }, {
            "name": ["Jane"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 siblings ]\n        "]
        }]
    }
}

Hello Martin. A noted by you previously, this issue has more to do with badly formed XML than an issue with the library. The example given is a contrived case created by me as I cannot put the entire XML message due to privacy concerns. Again, the xml2js library handles this cases as expected. This is not a library isue. — ivan quintero, Sep 07 '21 at 14:56

score 0 · Answer 2 · answered Sep 07 '21 at 10:17

In JavaScript, as it is the language you are using to code, you can use

const text = `<comment>
   <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
</comment>`
const re = /\[CDATA\[\[[^]*?]]>/g
console.log( text.replace(re, (x) => x.replace(/[<>]/g, '')) )

The \[CDATA\[\[[^]*?]]> pattern (see its demo) matches all CDATA blocks, even if they span multiple lines because

\[CDATA\[\[ matches [CDATA[[ substrings
[^]*? matches zero or more chars as few as possible
]]> matches ]]>.

Then, once the match is found, all < and > are removed from these matched texts with x.replace(/[<>]/g, '').

score 0 · Answer 3 · answered Sep 07 '21 at 19:22

0

I am providing an answer to the question using Wiktor Stribiżew in case this helps anyone with a similar problem.

(?<=<!\[CDATA\[\[(?:(?!<!\[CDATA\[\[|]]>).)*)[<>]

Thanks Wiktor

answered Sep 07 '21 at 19:22

ivan quintero

1,240
3
13
22

Remove <> symbols from CDATA XML tag using regex

3 Answers3