4

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.

Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.

  • Any idea if I can get REXML to read these lines?
  • If not, could I re-write them before hand with a regex or something?
  • Is this even Valid XML?

Here's an example XML document (much abridged):

<?xml version="1.0" encoding="utf-8"?>
<root-tag>
    <content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
    <content type="base64">
        <![CDATA[VGhpcyB3b250IHdvcms=]]></content>

    <content><![CDATA[This will work]]></content>
    <content>
        <![CDATA[This will not appear]]></content>

    <content>
        Seems happy</content>
    <content>Obviously no problem</content>
</root-tag>

and here's my Ruby script (distilled down to a minimal example):

require 'rexml/document'
require 'base64'
include REXML

module RexmlSpike
  file = File.new("ex.xml")
  doc = Document.new file
  doc.elements.each("root-tag/content") do |contentElement|
    if contentElement.attributes["type"] == "base64"
      puts "decoded: " << Base64.decode64(contentElement.text)
    else
      puts "raw: " << contentElement.text
    end
  end
  puts "Finished."
end

The output I get is:

>> ruby spike.rb
  decoded: Well done! It works :)
  decoded:
  raw: This will work
  raw:

  raw:
          Seems happy
  raw: Obviously no problem
  Finished.

I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.

Andrew M
  • 9,149
  • 6
  • 44
  • 63

3 Answers3

5

Why

Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.


Solution

If you look at the documentation for Element, you'll see that it has a function called cdatas() that:

Get an array of all CData children. IMMUTABLE.

So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.

lightswitch05
  • 9,058
  • 7
  • 52
  • 75
3

I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
    <content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
    <content type="base64">
        <![CDATA[VGhpcyB3b250IHdvcms=]]></content>

    <content><![CDATA[This will work]]></content>
    <content>
        <![CDATA[This will not appear]]></content>

    <content>
        Seems happy</content>
    <content>Obviously no problem</content>
</root-tag>
EOT

doc.search('content').each do |n|
  puts n.content
end

Which outputs:

V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==

        VGhpcyB3b250IHdvcms=
This will work

        This will not appear

        Seems happy
Obviously no problem
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks - I think I might give Nokogiri a try, it does sound like it's better, but it doesn't really answer the original question, so I'll leave it open to see if someone knows the answer. – Andrew M Aug 02 '13 at 06:40
  • That does not answeres the question or have a proper explanation of why he should use nokogiri instead of REXML – fotanus Aug 11 '13 at 00:39
  • He should use Nokogiri if he wants to parse the XML without the hassle that he's going through using REXML. Suggesting the OP use JSON instead of XML puts them even farther afield than Nokogiri vs. REXML. – the Tin Man Aug 11 '13 at 01:53
2

Your xml is valid, but not the way you expects, as @lightswitch05 pointed out. You can use the w3c xml validator

If you are using XML from the wild world web, it is a good idea to use nokogiri because it usually works as you think it should, not as it really should.

Side note: this is exactly why I avoid XML and use JSON instead: XML have a proper definition but no one seems to use it anyway.

fotanus
  • 19,618
  • 13
  • 77
  • 111
  • Interesting comparison of Nokogiri and rexml. Unfortunately when porting data between 2 third party systems you can't chose what that data looks like, we just have to try and work with it. – Andrew M Aug 13 '13 at 08:24
  • @AndrewM Yes, many times I could not choose too, that is how I realized it. Happy hacking. – fotanus Aug 13 '13 at 14:09