5

Unicode string:

string = "CEO Frye \u2013 response to Capitalism discussion in Davos: Vote aggressively with your wallet against firms without social conscience."

I tried (via Is this the best way to unescape unicode escape sequences in Ruby?):

def unescape_unicode(s)
   s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
end

unescape_unicode(string) #=> CEO Frye \u2013 response to Capitalism discussion in Davos: Vote aggressively with your wallet against firms without social conscience. 

But output (to file) is still identical to input! Any help would be appreciated.

Edit: Not using IRB, using RubyMine, and input is parsed from Twitter, hence the single "\u" not "\\u"

Edit 2: RubyMine IDE Output

Community
  • 1
  • 1
Mr. Demetrius Michael
  • 2,326
  • 5
  • 28
  • 40
  • 1
    `"\u2013"` is a literal unicode character... did you mean `"\\u2013"`? – J. Holmes Feb 10 '12 at 16:00
  • You know what that's probably the problem with the gsub. It's looking for \\u, not \u... I'm not too sure how to fix :(. "\u2013" is what I parsed, it's not manual input. – Mr. Demetrius Michael Feb 10 '12 at 16:57
  • As far as I can tell, there is no problem with the regex or the `unescape_unicode` helper. There just isn't anything to unescape in the string you have provided (as it is defined in the question). The problem may be more in how your are writing this to a file than a problem with the string. – J. Holmes Feb 10 '12 at 17:03
  • I added images. You think it's the RubyMine IDE? – Mr. Demetrius Michael Feb 10 '12 at 17:12
  • There isn't anything wrong... You just have a misunderstanding of what `string = "\u2013"` means. See LBg's answer. – J. Holmes Feb 10 '12 at 17:15
  • I'm sorry, I'm just new to ruby. Is there any way to convert it so it can output unescaped? (See images above, how it outputs literally). When I run the exact same string through IRB, the string is human-readable. Any idea why there's a difference? – Mr. Demetrius Michael Feb 10 '12 at 17:20
  • 1
    http://stackoverflow.com/questions/1255324/p-vs-puts-in-ruby – J. Holmes Feb 10 '12 at 17:31

1 Answers1

4

Are you trying it from irb, or outputting the string with p?

String#inspect (called from irb and p str) transform unicode characters into \uxxxx format to allow the string to be printed anywhere. Also, when you type "CEO Frye \u2013 response to...", this is a escaped sequence resolved by the ruby parser. It is a unicode character in the final string.

str1 = "a\u2013b"
str1.size #=> 3
str2 = "a\\u2013b"
str2.size #=> 8
unescape_unicode(str2) == str1 #=> true
Guilherme Bernal
  • 8,183
  • 25
  • 43
  • I'll edit the question. When I write the string to file (or p string), it writes it unicode escaped. Not using IRB, using RubyMine IDE. The string is grabbed from twitter, not manually entered too. – Mr. Demetrius Michael Feb 10 '12 at 16:54