Convert unicode chars in ruby string which already encoded in UTF-8

Question

I have a string values which is encoded in UTF-8. But also they may contain unicode chars.

For ex;

"\u0131".encoding
=> #<Encoding:UTF-8>

"\u0131" is "ı".

how can i convert all unicode chars to utf-8?

Thanks

Çağdaş

score 7 · Answer 1 · answered Feb 01 '13 at 08:53

Internally, in this string all unicode chars are already represented as utf-8 bytes. Let's check it.

> "\u0131".bytes.to_a
=> [196, 177]

OK, there are two bytes, but are they UTF-8 or UTF-16 bytes? The easiest way to check is to look at binary representation. Let's iterate each byte and print them in binary radix:

>> "\u0131".each_byte {|b| print b.to_s(2)};puts
1100010010110001
=> nil

This is the binary representation of your string — as you can see, that is correct UTF-8 two-byte sequence for char 100110001, that is, 0x0131:

110 00100 10 110001
---       --        ← UTF-8 markers for 2-byte char
    =====    ====== ← bits of your char

So the answer is — do nothing. The string is already utf-8, Q.E.D.

score 1 · Accepted Answer · answered Feb 01 '13 at 08:50

1

utf-8 is an encoding for unicode characters. You don't have to convert anything, your characters are already encoded in utf-8. If they are displayed as \u0131 or ı depends on the displaying program.

answered Feb 01 '13 at 08:50

Huluk

864
6
18

Thanks for response. I'm using rails and in my views strings ,which contains "ı" for ex., are displayed with "\u0131". I thought it is a encoding problem but maybe is a view render problem. How can i solve this? – Çağdaş Feb 01 '13 at 08:57
1

I don't know much about rails, but maybe some method is escaping your characters. You could try http://stackoverflow.com/questions/7015778/is-this-the-best-way-to-unescape-unicode-escape-sequences-in-ruby right before displaying the string. – Huluk Feb 01 '13 at 09:05
@Çağdaş, Apart from trying to unescape your string you could also try to find why it becomes escaped and prevent it. Where this string comes from to you view, and how is it being outputted in it? Also note the difference in Ruby string literals: `"\u0131"` and `'\u0131'` are not the same and you should use *double* quotes so that \u is treated right. – NIA Feb 01 '13 at 10:01
@NIA data is coming from a graph database which store rdf in N3 notation. And they only support ASCII encoding for this notation. So i cant change it:) – Çağdaş Feb 01 '13 at 11:38
@Çağdaş, any solution to this? I've having the same problem. My Rails view is just rendering the codes. – seenickcode Nov 20 '14 at 19:04

Convert unicode chars in ruby string which already encoded in UTF-8

2 Answers2