Convert unicode mess to correct characters in Ruby?

Question

I have a string such as:

"MÃ\u0083Â¼LLER".encoding
#<Encoding:UTF-8>   

"MÃ\u0083Â¼LLER".inspect    
"\"MÃ\\u0083Â¼LLER\""

What can I do to salvage such a string? Take into consideration I do not have the original data. Is this salvageable?

Hi, thanks for the interest. As I stated, I do not have the original data, and so have no idea. That is my problem, and I'm hoping there is a solution. The only information on the original data I have is that it was part of a php serialised object. — Damien Roche, Jun 11 '13 at 11:13

score 4 · Accepted Answer · answered Jun 11 '13 at 11:33

4

Looks like the string was converted from utf-8 to latin-1 twice. Try this on some of your data and let me know if it worked:

require 'iconv'

def decode(str)
  i = Iconv.new('LATIN1','UTF-8')
  i.iconv(i.iconv(str)).force_encoding('UTF-8')
end

decode("MÃ\u0083Â¼LLER")
#=> "MüLLER"

answered Jun 11 '13 at 11:33

Patrick Oscity

53,604
17
144
168

+1 You beat me to it by about 5s. :-D See also this related question for python: http://stackoverflow.com/questions/4267019/double-decoding-unicode-in-python – Denis de Bernardy Jun 11 '13 at 11:35
Do you have any advice on the second string `"MA\u008EEIKIAI"` which produces: `Iconv::IllegalSequence: "\x8EEIKIAI"`? – Damien Roche Jun 11 '13 at 11:36
Yes it does. Not sure if that is a good or bad thing. Has been very punishing dealing with this data and I am utterly lost to the point of despair. Where do I even look for where to start with this mess? – Damien Roche Jun 11 '13 at 11:47
Found out the second string should be `Mažeikiai` by searching google without the codes. So, essentially, I need to convert `\u008E` to `Ž`. – Damien Roche Jun 11 '13 at 11:54
I've tried converting from every encoding in `Iconv.list` to `UTF-8` once and twice without results. – Patrick Oscity Jun 11 '13 at 12:09
Well, thank you very much for you help anyway. This `decode` method will come in handy. – Damien Roche Jun 11 '13 at 12:11
Posted new question http://stackoverflow.com/questions/17043840/convert-u008e-to-in-ruby. Hopefully something will turn up. Thanks again for your help. – Damien Roche Jun 11 '13 at 12:16
1

I think your data contains strings with different encodings. – Patrick Oscity Jun 11 '13 at 12:17

Convert unicode mess to correct characters in Ruby?

1 Answers1

Linked