2

Someone in email sent me letters like this

IVIØR†€™

correct should be

IVIØR†€™

suppose to be How do I represent them in their original Portuguese langauge, it got altered after being passed through HTTP GET request.

I probably will not be able to fix the site.. but maybe create a repair tool to repair these broken encoded letters? or anyone know of any repair tool? or how to do it manually by hand? Seems like nothing is lost.. just badly interpreted

SSpoke
  • 5,656
  • 10
  • 72
  • 124

2 Answers2

4

What happened here is that UTF-8 got misinterpreted as ISO-8859-1; and then other kinds of mangling (the bad ISO-8859-1 string being re-UTF-8-encoded; the non-breaking space character '\xA0' being converted to regular space '\x20') seem to have happened afterward, though those may just be a result of pasting it into Stack Overflow.

Due to the subsequent mangling, there's no really good way to completely undo it, but you can largely undo it by passing it through a not-very-strict UTF-8 interpreter. For example, if I save "IVIØR†€™" as a text-file on my computer, using Notepad, with the "ANSI" (single-byte) encoding, and then I open it in Firefox and tell it to interpret it as UTF-8 (Firefox > Web Developer > Character Encoding > Unicode (UTF-8)), then it displays "IVIØR� €™". (The "�" is because of the '\xA0' having been changed to '\x20', which broke the UTF-8 encoding.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Wow thank you.. Any idea? how i can fix the site to convert everything properly? `html_entities` ? or something – SSpoke Oct 16 '11 at 04:03
  • This a website or an email? The website should send a header along with the data that specifies UTF-8 (but if it doesn't, the default is latin1) — essentially the same thing if this is an email. If this is a webmail site, the underlying site "software" should be converting the email to the same encoding as the webmail's page as part of rendering the page, and sending the appropriate headers. That said, I've used webmail clients that blindly ignore character encodings — Emumail, in particular, used at my school, would corrupt every UTF-8 email. – Thanatos Oct 16 '11 at 22:58
  • @Thanatos can you help me with this out? `†Bakâ€` any corruption on this one? site has `` but it does nothing. – SSpoke Oct 18 '11 at 00:08
  • 2
    That one is hard to say. This is A GUESS: It certainly looks like UTF-8 data, decoded as windows-1252 (latin1, while common, does not have a euro sign). `windows1252_encode('â€')` results in something that is 2 bytes of a 3-byte UTF-8 sequence. So, we're missing the last byte, which may be showing as a space because it landed on an undefined octet or a control character. `”`, a 'smart quote' does just this, and is common, that could be ”Bak” (but with two closing smart quotes. `“` would look like `“` if this were correct. – Thanatos Oct 18 '11 at 06:55
  • 1
    As for why: look at what headers the HTTP server is sending, specifically Content-Type. Content-Type will take precedence over `` tags, so if it is there, that is what gets used. (though erroneously sending windows-1252 seems a bit hard to do.) – Thanatos Oct 18 '11 at 06:56
  • Client responded in email, it's really `†Bak†`, now another one came in `°°LyLy°Â` anything missing on this one? I have shutdown the service temporary until i test this out. I use apache `httpd` on RedHat OS – SSpoke Oct 20 '11 at 01:29
0

They're probably not broken. It's just a difference between the encoding they were sent in, vs. the decoding you're viewing them in.

Figure out what encoding was originally used, and use the same one to decode it, and it should look like the original. In terms of writing a "fix-it" tool, you'd always need to know what encoding they were originally created in, which can be complicated depending on the source, and whether or not you have access to said information.

jefflunt
  • 33,527
  • 7
  • 88
  • 126
  • Hmm.. so it's the browsers fault on their side? problem solved by his secondary email thank god, the correct one is `IVIØR†€™` he said paypal displays it wrong.. How can I fix this? should I URLEncode everything on site – SSpoke Oct 16 '11 at 03:17
  • The answer depends on the language/framework you're using, and it usually involves some research specific to that framework. Search around StackOverflow for "character encoding" + the framework(s) in question - you'll see some of the complexity involved, and once you figure out what's at the root of it in your case, there should also be some answers for you that can be more specific. – jefflunt Oct 16 '11 at 03:24
  • 2
    See the following StackOverflow question for information about character encoding detection; maybe that's your issue: http://stackoverflow.com/questions/774075/character-encoding-detection-algorithm – Jared Oberhaus Oct 16 '11 at 03:28