0

I am working with some XML data and I am stacked trying to remove CDATA in XML. I tried many ways, and it seems the simplier is by replacing all patterns

hey <![CDATA[mate - number 1]]> what's up

by

hey mate - number 1 what's up

Regex, in order to get the whole expression is (\<\!\[CDATA\[)(.*)(\]\]\>), so when using PERL (PCRE), I just need to replace by \2.

By this, and taking advantage of Powershell, I am running in CMD:

powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', '\2' | Out-File Desktop\test_out.xml")

Although the result is everthing is replaced by string \2, instead of mate - number 1 in the example.

Instead of \2, I tried (?<=(\<\!\[CDATA\[))(.*?)(?=(\]\]\>)) since I am getting with this the inner part I am trying to keep, although the result is frustating, again literal replacing.

Any guess?

Thank you!

PS. If anyone know how to avoid this replacing in R, it is usefull as well.

2 Answers2

0

Any XSLT that runs the Identity Transform (i.e., copies itself) will remove the <CData> tags. Consider running with R's xslt package or with PowerShell:

library(xml2)
library(xslt)

txt <- "<root>
              <data>hey <![CDATA[mate - number 1]]> what's up</data>
       </root>"    
doc <- read_xml(txt)

txt <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
            <xsl:output indent="yes"/>
            <xsl:strip-space elements="*"/>

            <xsl:template match="@*|node()">
              <xsl:copy>
                 <xsl:apply-templates select="@*|node()"/>
              </xsl:copy>
            </xsl:template>

         </xsl:stylesheet>'    
style <- read_xml(txt, package = "xslt")

new_xml <- xml_xslt(doc, style)

# Output
cat(as.character(new_xml))

# <?xml version="1.0" encoding="UTF-8"?>
# <root>
#    <data>hey mate - number 1 what's up</data>
# </root>

Powershell

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;

$xslt.Load("C:\Path\To\Identity_Transform\Script.xsl");
$xslt.Transform("C:\Path\To\Input.xml", "C:\Path\To\Output.xml");
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

Powershell variables are $1 $2 etc, in powershell you always use the variables instead of traditional # notation implemented in most languages.

Now, I am on mobile at the moment or I wouldtest so I may be off, but I believe this will do the needful:

 powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', "$2" | Out-File Desktop\test_out.xml")

You can also create named capture groups if you like:

 powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(?<CData>.*)(\]\]\>)', "${CData}" | Out-File Desktop\test_out.xml")
Ben Personick
  • 3,074
  • 1
  • 22
  • 29