How to decode unexpected strings from users?

Question

I've published an app, and I find some of the comments to be like this: &ETH;&nbsp;&ETH;&micro;&ETH;&ordm;&ETH;&deg;&ETH;&frac14;&ETH;&micro;&ETH;&acute;&Ntilde;

I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?

It was probably doubly HTML encoded. When decoding it twice using https://mothereff.in/html-entities the result is `Ð ÐµÐºÐ°Ð¼ÐµÐ´Ñ`. That *could* be botched Unicode data — Pekka, Feb 08 '16 at 15:19
and how come a simple user could write that from his mobile phone? — Filip Luchianenco, Feb 08 '16 at 15:20
This is not what users have typed in, this is how your input form has screwed it. You need to find and fix the bug, not 'decode' the accidental garbage. — hamstergene, Feb 08 '16 at 15:21
It could be that they are entering comments say in UTF-8, but in a non-western character set. Then probably a silly, misguided server-side "sanitation" routine garbles the data. — Pekka, Feb 08 '16 at 15:21
Mismatched text encoding somewhere in your pipeline -- the user didn't botch it up. — Daniel Beck, Feb 08 '16 at 15:22
the thing is that other comments are fine. what could this user type so that the output is this? — Filip Luchianenco, Feb 08 '16 at 15:22
`what could this user type so that the output is this?` هذا، على سبيل المثال — Pekka, Feb 08 '16 at 15:22
లేదా ఈ! ఏ ఆలోచన లాంగ్వేజ్గా ఉంది — Pekka, Feb 08 '16 at 15:23
Anything at all, in a locale or language that uses a different text encoding than you're set up for. (When in doubt, just UTF-8 everything; these days that'll get you 90% of the way there at least) — Daniel Beck, Feb 08 '16 at 15:24
thank you guys. The problem was with Russian language input. It would be great if any of you can post a formatted answer so that I can accept it. — Filip Luchianenco, Feb 08 '16 at 15:29

score 1 · Accepted Answer · edited May 23 '17 at 11:45

These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &amp;

When decoding the data twice using this online tool (there are many others) the result is

Ð ÐµÐºÐ°Ð¼ÐµÐ´Ñ

That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that

was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).

The fix will likely need to be on server side.

If you are using PHP, see UTF-8 all the way through for a handy guide.

How to decode unexpected strings from users?

1 Answers1