1

Hi I found really useful the apache operator

StringUtils.substringBetween(fileContent, "<![CDATA[", "]]>") 

to extract information inside

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<envelope>
    <xxxx>
        <yyyy>
            <![CDATA[

                    <?xml version="1.0" encoding="UTF-8" ?>
                    <Document >
                        <eee>
                            <tt>
                                <ss>zzzzzzz</ss>
                                <aa>2021-09-09T10:39:29.850Z</aa>
                                <aaaa>
                                    <Cd>cccc</Cd>
                                </aaaa>
                                <dd>ssss</dd>
                                <ff></ff>
                            </tt>
                        </eee>
                    </Document>
                ]]>
        </yyyy>
    </xxxx>
</envelope>

But now what I'm looking is another operator or regex that allow me to replace a dynamic xml

![CDATA["old_xml"]] 

by another xml

![CDATA["new_xml"]]

Any idea idea how to accomplish this?

Regards.

Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
paul
  • 12,873
  • 23
  • 91
  • 153
  • 1
    This works great..... until you have an XML with two CDATA sections, one after the other. As has been discussed at [***great and passionate length***](https://stackoverflow.com/a/1732454/18157) on this site over the past decade, regex is categorically the WRONG tool for working with arbitrary XML, HTML, JSON, etc. You need a real parser for whatever flavor you're dealing with. _"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." -- [Jamie Zawinski](https://en.wikipedia.org/wiki/Jamie_Zawinski)_ – Jim Garrison Oct 30 '21 at 00:09
  • 100% agree, but I only extracting the "text" from CDATA, once that I extract the text I use DOM parser – paul Oct 30 '21 at 09:22
  • If you insist on using regex, be prepared for it to break when you least expect it. Also prepare to be cursed by whoever has to maintain it. – Jim Garrison Oct 31 '21 at 00:25
  • The idea has been finally rejected XD – paul Oct 31 '21 at 00:26

2 Answers2

1

Instead of StringUtils, you can use String#replaceAll method:

fileContent = fileContent
  .replaceAll("(?s)(<!\\[CDATA\\[).+?(]]>)", "$1foo$2");

Explanation:

  • (?s): Enable DOTALL mode so that . can match line breaks as well in .+?
  • (<!\\[CDATA\\[): Match opening <![CDATA[ substring and capture in group #1
  • .+?: Match 0 or more of any characters including line break
  • (]]>): Match closing ]]? substring and capture in group #2
  • $1foo$2: Replace with foo surrounded with back-references of capture group 1 and 2 on both sides
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • @paul: Answer is updated. Had you provided this use-case data upfront we would have got to this answer in first go itself. – anubhava Oct 30 '21 at 04:59
  • `.replaceAll("(?s)(<!\\[CDATA\\[).+?(]]>)", "$1foo$2")` -- what is `(?s)`? Did you mean `(?:\\s)`? – Jim Garrison Oct 31 '21 at 00:21
  • @JimGarrison: `(?s)` is for enabling `DOTALL` mode so that `.` can match line breaks as well in `.+?` – anubhava Oct 31 '21 at 03:33
1

You can use the regex, (\<!\[CDATA\[).*?(\]\]>).

Demo:

public class Main {
    public static void main(String[] args) {
        String xml = """
                ...
                    <data><![CDATA[a < b]]></data>
                ...
                """;

        String replacement = "foo";

        xml = xml.replaceAll("(\\<!\\[CDATA\\[).*?(\\]\\]>)", "$1" + replacement + "$2");

        System.out.println(xml);
    }
}

Output:

...
    <data><![CDATA[foo]]></data>
...

Explanation of the regex:

  • ( : Start of group#1
    • \<!\[CDATA\[ : String <![CDATA[
  • ) : End of group#1
  • .*? : Any character any number of times
  • ( : Start of group#2
    • \]\]>: String ]]>
  • ) : End of group#2
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110