1

I'm trying to parse a big XML file to get all outer XML tag content, something like this:

<string name="key"><![CDATA[Hey I'm a tag with & and other characters]]></string>

to get this:

<![CDATA[Hey I'm a tag with & and other characters]]>

Although, when I use Nokogiri's SAX XML parser I only get the text without CDATA and with characters escaped, like this:

Hey I\'m a tag with &amp; and other characters

This is my code:

  class IDCollector < Nokogiri::XML::SAX::Document
    def initialize
    end

    def characters string
        puts string # this does not works, CDATA tag is not printed  
    end

    def cdata_block string
      puts string
      puts "<![CDATA[" + string + "]]>"
    end
  end

Is there any way to do this with Nokogiri SAX?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Luis Pereira
  • 1,481
  • 3
  • 19
  • 46
  • It's not exactly clear what you're trying to do: read or generate the CDATA block? You won't get `<![CDATA[Hey I'm a tag with & and other characters]]>` because it's a block, not a tag or element. `<![CDATA[` is effectively the tag but it's processed out and only its content is returned. http://stackoverflow.com/q/2784183 might help. I can't duplicate getting the encoded result. – the Tin Man Feb 27 '17 at 19:29
  • My final goal is to port some xml tags with their inner content to other files. Although, both file large and I have to use SAX or else I have an memory exception – Luis Pereira Feb 28 '17 at 08:52

2 Answers2

1

It's not clear what you're trying to do, but this might help clear things up.

A <![CDATA[...]]> entry isn't a tag, it's a block, and is treated differently by the parser. When the block is encountered the <![CDATA[ and ]]> are stripped off so you'll only see the string inside. See "What does <![CDATA[]]> in XML mean?" for more information.

If you're trying to create a CDATA block in XML it can be done easily using:

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"

<< is just shorthand to create a child node.

Trying to use inner_html doesn't do what you want as it creates a text node as a child:

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with &amp; and other characters</string>\n"
doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
doc.at('string').children.first.class # => Nokogiri::XML::Text

Using inner_html causes HTML encoding of the string to occur, which is the alternative way of embedding text that could include tags. Without the encoding or using CDATA the XML parsers could get confused about what is text versus what is a real tag. I've written RSS aggregators, and having to deal with incorrectly encoded embedded HTML in a feed is a pain.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Actually I prefer this (Nokogiri::XML::CDATA.new) to what I answer. Also, thanks for the described answer it helped :) – Luis Pereira Feb 28 '17 at 08:56
0

After a while checking the documentation, I think this is only possible by building a new CDATA content with the help of Nokogiri, something like this:

  tmp = Nokogiri::XML::Document.new
  value = tmp.create_cdata(value)
  r = doc.at_xpath(PATH_TO_REPLACE)
  r.inner_html = value
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Luis Pereira
  • 1,481
  • 3
  • 19
  • 46