Convert Ruby string with ampersand-hash-char-semicolon characters into an ascii or html friendly string

Question

Using Rails 3 I am consuming an XML feed generated in drupal or something. The tags it gives me look like:

<body><![CDATA[&#60;p&#62;This is a title&#60;br /&#62;A subheading&#60;/p&#62;]]></body>

So the intention is that this should really look like:

<p>This is a title<br />A subheading</p>

Which could then be rendered in a view using <%= @mystring.html_safe %> or <%= raw @mystring %> or something. The trouble is that rendering the string in this way will simply convert substrings like < into the < character. I need a sort of double raw or double unencode to first deal with the chr and then render the tags as html safe.

Anyone know of anything like:

<%= @my_double_safed_string.html_safe.html_safe %>

It looks like raw and html_safe don't encode or decode strings, they simply mark them as safe to include in your document - it's a protection mechanism against cross site scripting bugs. — Blixxy, May 09 '12 at 23:06

score 6 · Accepted Answer · answered May 09 '12 at 22:52

6

I don't think this is valid XML - they've sort of escaped the text twice in two different ways, by using entities and cdata. Still, you can parse it using nokogiri for example:

require 'nokogiri'

xml = Nokogiri::XML.parse "<body><![CDATA[&#60;p&#62;This is a title&#60;br /&#62;A subheading&#60;/p&#62;]]></body>"
text = Nokogiri::XML.parse("<e>#{xml.text}</e>").text
#=> text = "<p>This is a title<br />A subheading</p>"

Seeing as this drupal site is spewing crazy double escaped xml, I'd be inclined to even use a regexp. Hacks to solve a problem hacks created? IDK. Regardless:

xml.text
#=> "&#60;p&#62;This is a title&#60;br /&#62;A subheading&#60;/p&#62;"
xml.text.gsub(/\&\#([0-9]+);/) { |i| $1.to_i.chr }
#=> "<p>This is a title<br />A subheading</p>"

Hope this helps!

answered May 09 '12 at 22:52

Blixxy

696
5
16

1

Yea the use of entities inside CDATA is bewildering and kind of defeats the purpose of using CDATA. – Andrew Marshall May 09 '12 at 22:54
FYI, If this drupal feed is an RSS or Atom feed, you should have a look at http://ruby-doc.org/stdlib-1.9.3/libdoc/rss/rdoc/index.html - It's a feed parser/generator library included with ruby - you could combine it with the gsub-solution in this answer to avoid the nokogiri dependancy, if you're in to that sort of thing. Nokogiri is pretty great though! – Blixxy May 09 '12 at 23:03
Technically it's valid, but completely unnecessary. – Mark Thomas May 10 '12 at 01:03
Valid as in "won't crash the parser" or valid as in correctly expresses the intent of a string containing html as the text body of the element? If this is supposed to parse in the more useful way, maybe time to bug fix nokogiri! – Blixxy May 10 '12 at 04:23
1

Awesome guys, I was thinking of taking a crack at a regex hack myself, but a double Nokogiri parse is probably the way to go this time. Thanks. – genkilabs May 10 '12 at 14:30
Absolutely. Regular Expressions are generally the wrong way to solve a problem involving irregular languages like XML - that is why I only used it on the entities. Best to use a real xml parser if you can. :) – Blixxy May 10 '12 at 22:41

Convert Ruby string with ampersand-hash-char-semicolon characters into an ascii or html friendly string

1 Answers1