Ruby and encoding conversion

Question

I'm importing a CSV file into Ruby (1.8.7). File.open('path/to/file.csv').read returns this in the console:

Stefan,Engstr\232m

The encoding is identified as iso-8859-2 by UniversalDetector (chardet gem).

UniversalDetector::chardet("Stefan,Engstr\232m")
=> {"confidence"=>0.626936305574385, "encoding"=>"ISO-8859-2"}

Trying to convert the string yields the following:

Iconv.conv("UTF-8", "ISO-8859-2", "Stefan,Engstr\232m")
 => "Stefan,Engstrm"

whereas I would expect:

 => "Stefan,Engström"

Could the string really be in some other encoding?
I haven't seen the \232 syntax before, usually when strings are strangely encoded some weird character will show up instead, e.g. � or some chinese.

Let me know if I should provide more information or elaborate on something.

It does not look like it's `ISO-8859-2`. It would be `\246` http://en.wikipedia.org/wiki/ISO_8859-2 — Kassym Dorsel, Dec 07 '11 at 19:24
@Kassym: It would be `\366` in ISO 8859-2, the `"\nnn"` notation uses octal. — mu is too short, Dec 07 '11 at 19:39

mu is too short · Accepted Answer · 2011-12-07T22:06:48.540

5

The encoding is probably "Macintosh Roman", a couple other options would be "Mac Central European" and "Mac Icelandic". The \nnn notation uses octal so \232 is 154 in decimal and character 154 is the lower case O-umlaut ("ö") that you're expecting in all three of those encodings; I don't see 154 in any of the Windows codepages or ISO 8859 character sets. I'd guess that Mac Roman is more common than the Icelandic or Central European encodings.

Try using 'MacRoman' as your source encoding with Iconv:

>> Iconv.conv("UTF-8", "MacRoman", "Stefan,Engstr\232m")
=> "Stefan,Engström"

edited Dec 07 '11 at 22:06

answered Dec 07 '11 at 19:51

mu is too short

426,620
70
833
800

THANKS! I tried on the larger dataset and it worked well too. Now I just have to figure out how to detect the charset, since `UniversalDetector::chardet` couldn't do it correctly. You seem to know this area very well -- any thoughts? – sandstrom Dec 08 '11 at 08:40
1

After reading some more it seems to be tricky to distinguish Macroman. http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and – sandstrom Dec 08 '11 at 08:54
@sandstrom: I don't know as much about encodings as tchrist (the author of the question you linked to). You'll notice that the three Mac encodings I listed overlap at ö so it could have been any of them, I guess MacRoman because that's more common than the others. This sort of thing usually boils down to a bunch of guess work based on empty slots in the various character tables, then you try it and change your guess if it breaks; dan04's answer is about as good as it gets. These days I do everything in UTF-8 and give people dirty looks if they try to use anything else :) – mu is too short Dec 08 '11 at 09:13
@sandstrom: The whole encoding issue is a nightmare and now you're starting to know why everything is moving towards UTF-8 for transport. – mu is too short Dec 08 '11 at 09:15

Ruby and encoding conversion

1 Answers1