how to remove CDATA FROM XML FILE using SED (linux)

Question

I am trying to remove the following patterns from an xml file:

<![CDATA[
]]>

For this purpouse I used the following sed command from Remove CDATA tags from XML file:

sed -e 's/<![CDATA[//g' | sed -e 's/]]>//g' file.xml

The problem is that I am not being able to locate these patterns. It is printing the whole text with the patterns all over again.

<text>
<![CDATA[
ethnic minority communities have been in Belfast since the 1930s.]]>
<\text>

Previous questions

How to remove CDATA from my xml parser? uses JAVA
how to remove CDATA from xml in java code uses JAVA

Does it have to be `sed`? Reason I ask, is because regular expressions are NOT good tools to handle XML. They're at best dirty body hacks. But this also raises the question - what are you trying to accomplish here? Can you give a desired output and input. (Start with valid XML is good). — Sobrique, Sep 29 '15 at 15:21
@Sobrique I don't want to use tools like XML-TWIG or python xml because in my text I have symbols like & that Will cause an error When I use it. This Is why I went with sed or grep. I AM wrong right? i guess — Hani Goc, Sep 29 '15 at 15:22
If it causes an error, your XML is broken and you should reject it. You absolutely shouldn't try and 'fix' broken XML, for all the reasons you wouldn't try and 'fix' broken program code with another program. — Sobrique, Sep 29 '15 at 15:23
Well because I tried them many times. If I have a weird symbol in the text game over I have to do everything again. The structure is ok. it's the the weird symbols like **and &**. — Hani Goc, Sep 29 '15 at 15:26
http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean — Sobrique, Sep 29 '15 at 15:31
ohhhh sheesh well. @Sobrique what can I use for that? I was starting to write a python script to extract them — Hani Goc, Sep 29 '15 at 15:38
Is that `<\text>` a typo for ``? Or is your input really that broken? — Toby Speight, Jan 10 '18 at 16:57

score 3 · Accepted Answer · answered Sep 29 '15 at 18:17

I suggest the versatile XmlStarlet tool. To remove the CDATA section and leave only the text content, use this command:

xml fo --omit-decl --nocdata file.xml

Output:

<text>
ethnic minority communities have been in Belfast since the 1930s.
</text>

When removing the CDATA section (which itself is an escape mechanism), XmlStarlet automatically escapes ampersands (&) which have a special meaning in XML. An input document like this,

<text>
<![CDATA[
ethnic minorities & communities have been in Belfast since the 1930s.]]>
</text>

would result in this output:

<text>
ethnic minorities &amp; communities have been in Belfast since the 1930s.
</text>

On debian-flavored linux derivates, the command will be available at `xmlstarlet`, or installable via `apt-get install xmlstarlet`. — k0pernikus, Jun 28 '17 at 15:45

score 0 · Answer 2 · answered Sep 29 '15 at 16:24

0

xml_grep --text_only 'text' intput.xml > output.txt

where text is the name of the xml element.

answered Sep 29 '15 at 16:24

Hani Goc

2,371
5
45
89

score 0 · Answer 3 · answered Jan 10 '18 at 14:26

Trying to give an answer to the original question because I got here and could not find one.

You need to escape the opening square brackets in the expression, because otherwise you open a character class with it. You don't need to escape the closing ones in the characters for the closing CDATA part (because no character class section is open in the regex), but you can and should for completeness, as they also have a different meaning when not escaped.

And - by the way - you can tell sed to use multiple replacements by separating them with semicolon in the expression:

sed -e 's/<!\[CDATA\[//g; s/\]\]>//g' file.xml

how to remove CDATA FROM XML FILE using SED (linux)

Previous questions

3 Answers3