5

I am having a very difficult time with this:

# contained within:
"MA\u008EEIKIAI"

# should be
"MAŽEIKIAI"

# nature of string
$ p string3
"MA\u008EEIKIAI" 

$ puts string3
MAEIKIAI

$ string3.inspect
"\"MA\\u008EEIKIAI\""

$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes> 

Any ideas on where to start?

Note: this is not a duplicate of my previous question.

Community
  • 1
  • 1
Damien Roche
  • 13,189
  • 18
  • 68
  • 96
  • Then it probably should be `"MAŽEIKIAI"` and not `"Mažeikiai"` – Patrick Oscity Jun 11 '13 at 12:18
  • I did say to ignore case @padde. – Damien Roche Jun 11 '13 at 12:18
  • 1
    Yes but what i posted is the string you expect, right? I think it's easier to post the exact expected result than adding a note to ignore the case. – Patrick Oscity Jun 11 '13 at 12:19
  • @Wooble not at all duplicate, and that question doesn't have any answers which even nearly begin to answer this question. – Damien Roche Jun 11 '13 at 12:21
  • @Wooble: wrong. `ruby -e 'puts "\u008E"'` prints nothing. – Patrick Oscity Jun 11 '13 at 12:21
  • 1
    @Wooble: actually it prints the ["single shift two" character](http://www.fileformat.info/info/unicode/char/8e/index.htm) but certainly not `Ž`. – Patrick Oscity Jun 11 '13 at 12:24
  • Indeed; comment deleted. I think the problem here is that the original string was generated from garbage, not a validly-encoded Ž character (which is `\u017d`) – Wooble Jun 11 '13 at 12:27
  • The problem I find, as in the link you provided @Wooble, is that under the Java section we can clearly see `Ž`. They are, in some way, closely related. What I cannot suss out, because of my inexperience with encodings, is how they are related, and how that site alone has derived `Ž` from `\u008e`? – Damien Roche Jun 11 '13 at 12:30
  • I think you are confused with `Kernel#p`, `Kernel#puts`, `String#to_s`(or String itself), `String#inspect`. You should read the documentation of these methods. Anyway, according to your code, the `string3` contains valid sequence `MAŽEIKIAI`, what you saw and got confused at was different representation of that string. – Arie Xiao Jun 11 '13 at 12:52
  • Could you add what `string3.bytes` shows? (I think I know, but just to be sure). – matt Jun 11 '13 at 12:57
  • Also, where does this string come from, and is this just an example of many similar problems you have or is it just the one string? – matt Jun 11 '13 at 13:16
  • @matt I've updated with `bytes` version of string. I've had problems with a few strings and found solutions, but this particular string has been causing many problems. Primarily because there doesn't seem to be a simple answer to convert the actual data `\u008e` to `Ž`. I mean, a regex would fix this problem, so I question whether such a library exists. – Damien Roche Jun 11 '13 at 13:39
  • @Zenph very interesting that the `Ž` character occurs on the page about the single shift two (SST) character. But i do not see how a `Ž` could become SST, only the other way round when using Java's 'toUpperCase()'. – Patrick Oscity Jun 13 '13 at 08:13

2 Answers2

6

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.

The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.

Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:

string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '')       # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8')          # convert to the desired encoding

Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

matt
  • 78,533
  • 8
  • 163
  • 197
  • I have 500k bad strings. What a nightmare! Thanks, I believe this will bring me closer to a final solution, but I have a feeling there will have to be compromises. – Damien Roche Jun 11 '13 at 13:42
  • @Zenph I suspect you may have had trouble with Ž in particular because it is in CP-1252, you likely had a lot of other characters that were in iso-8859-1 (latin 1). CP-1252 is a superset of 8859-1. – matt Jun 11 '13 at 13:44
  • That's correct, I believe the data was originally encoded in latin 1. – Damien Roche Jun 11 '13 at 13:46
5

What about using Regexp & String#pack to convert the Unicode escape?

str = "MA\\u008EEIKIAI"
puts str    #=> MA\u008EEIKIAI

str.gsub!(/\\u(.{4})/) do |match|
  [$1.to_i(16)].pack('U')
end
puts str    #=> MA EIKIAI
Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • I was beginning to think `gsub` was the best option. Thanks. I'll try this out. – Damien Roche Jun 11 '13 at 12:31
  • The *only* problem with this is my string doesn't appear to have the unicode escaped. I can use `inspect` to escape, but that produces `MAEIKIAI`. The literal string I am working with is `MA\u008EEIKIAI`, not `MA\\u008EEIKIAI`. – Damien Roche Jun 11 '13 at 12:40
  • @Zenph then you don't need to escape the string at all. `String#inspect` is the 'source code' representation of the string, you can copy/paste it to your source code and Ruby happily accept it. So the `inspect` version of a string will contain leading and tailing `"`, however, the underlying string doesn't have that two characters. – Arie Xiao Jun 11 '13 at 12:44
  • I've updated my question to show the differences in the string you've presented in your answer, and the actual string I'm dealing with. – Damien Roche Jun 11 '13 at 12:48
  • @Zenph You don't need conversion at all. `string3` contains that the String `MAŽEIKIAI` you need. `String#p` shows how you can code that in your ruby code. And `Kernel#puts` shows what exactly the String is. – Arie Xiao Jun 11 '13 at 12:57
  • @Zenph try `str = "a"; p str; puts str`, check the difference, (the `"` in `p`'s output). – Arie Xiao Jun 11 '13 at 12:59
  • That's great news, but then how do I display `MAŽEIKIAI`? Your `puts str` outputs correctly, mine does not. I'm totally confused. – Damien Roche Jun 11 '13 at 12:59
  • @Zenph that's an issue related to the terminal. You'll need a terminal that supports Unicode display. The shell under some Linux distribution/Mac OS X should work. – Arie Xiao Jun 11 '13 at 13:03
  • @Zenph it seems that I was wrong, `\u008E` is not unicode representation of `Ž`. – Arie Xiao Jun 11 '13 at 13:08