4

So I've got this UTF-8 string in an XML file:

Horrible place. ☠☠☠

And when I feed it to an external application, the funny characters come back escaped as XML entities:

Horrible place. ☠☠☠

In Ruby, how do I convert that string back to UTF-8? There's probably a really easy solution for this, but I'm unable to find anything in the standard libraries; eg. CGI.unescapeHTML (which work nicely for things like >) seem to ignore them completely.

ree-1.8.7-2010.02 > CGI.unescapeHTML('>')
 => ">" 
ree-1.8.7-2010.02 > CGI.unescapeHTML('☠')
 => "☠" 
lambshaanxy
  • 22,552
  • 10
  • 68
  • 92

2 Answers2

4

Well, since it's XML encoded I'd go for an XML parser:

require 'nokogiri'

frag = 'Horrible place. ☠☠☠'
doc = Nokogiri::XML.fragment(frag)
puts doc.text
# >> Horrible place. ☠☠☠
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2

CGI.unescapeHTML works just fine; the console you are using is probably unable to display the unicode character.

Try this and it should work fine:

File.open("d:\\11.txt", 'w') {|f| f.write(CGI.unescapeHTML('☠')) } # => ☠
Zabba
  • 64,285
  • 47
  • 179
  • 207
  • Doesn't work for me, the file says "☠" and so does splitting the output string into bytes: `CGI.unescapeHTML('☠').bytes.each {|b| print "#{b} "} => 38 35 120 50 54 50 48 59`. This is in Rails 2.3, what version are you using? – lambshaanxy Dec 30 '10 at 02:56
  • That's weird. I used ruby 1.8.7 (2010-08-16 patchlevel 302) [i386-mingw32]. – Zabba Dec 30 '10 at 03:03